Rapid structure lookup and distributed substructure searches in very large databases

CINF 42

Marc C. Nicklaus, mn1@helix.nih.gov, Laboratory of Medicinal Chemistry, Center for Cancer Research, National Cancer Institute, NIH, DHHS, 376 Boyles Street, Frederick, MD 21702, Markus Sitzmann, sitzmann@helix.nih.gov, Laboratory of Medicinal Chemistry, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, Frederick, MD 21702, Igor V. Filippov, Laboratory of Medicinal Chemistry, SAIC-Frederick, Inc., NCI-Frederick, Frederick, MD 21702, and Wolf-Dietrich Ihlenfeldt, Xemistry GmbH, Auf den Stieden 8, D-35094 Lahntal, Germany.
We present new tools and services developed by the CADD Group, NCI, for searching for structures in very large databases, such as very large screening sample collections. One of these tools is a service for very rapid structure lookup, making use of InChIs as well as CACTVS hash code-based identifiers. These latter, designed to allow one take into account tautomerism, different resonance structures drawn for charged species, and presence of additional fragments, enable fine-tunable yet rapid compound identification and database overlap analyses. We also present a powerful substructure search tool, implemented in the form of a web service, for databases of millions of compounds, using a search engine operating in distributed mode across a Linux cluster. Finally, a tool for automatic generation of a web interface, for searches by substructure and other criteria, from a database file, e.g. an SDF, is presented. Some of these tools and services are being made publicly available on the CADD Group's web server.