Building and using an in-house platform for data mining and analysis integrating open source and proprietary software: I. Designing and constructing the framework

CINF 28

Erik Evensen, ee@sunesis.com, Hans E. Purkey, hpurkey@sunesis.com, Ken Lind, and Erin K. Bradley, ebradley@sunesis.com. Computational Sciences, Sunesis Pharmaceuticals Inc, 341 Oyster Point Blvd., South San Francisco, CA 94080
A common problem faced by computational chemists is integrating and transferring data among numerous and disparate systems. This process often involves managing and translating multiple flat files, a process that does not scale well to complex workflows with large data sets. We have constructed a database-backed platform utilizing open source software, primarily MySQL and Python, that enables building complicated data management and analysis processes incorporating data generated by both open and closed source software. In addition, we have developed internal protocols based on open standards such as XML-RPC to make available computational results both within and outside of our platform. By using well-known, open standards, we are able to leverage widely available knowledge and experience. We will present lessons learned and wisdom gained during the development of this platform.