354 Handbook of Chemoinformatics Algorithms
Exposing the chemoinformatics functionality of the underlying toolkit within a
database is not particularly difficult and simply requires that one conform to the
prescribed databaseAPI. On the other hand, the representation used to store molecules
can significantly affect query efficiency. Thus, for example, one can store a molecule
as a SMILES string—indeed this is probably the most platform-independent way of
storing it. However, during a query, each SMILES string must be parsed and an internal
molecule object must be created. Invariably this will not be retained for future queries.
To alleviate this, one can generate binary representations of a molecule and store them
in a column of appropriate type (such as bytea in PostgreSQL). In this scenario, the
binary form may simply represent a serialized form of an internal molecule object.
Given that deserialization can be much faster than parsing, this representation can
lead to improvements in query efficiency. Both Tigress and RDkit currently support
binary molecular representations.
The use of chemoinformatics cartridges can result in cleaner and more efficient
chemical information infrastructures by virtue of moving complexity away from
frontends (or clients) into the backend database.
12.5.2 INDEXING CHEMICAL INFORMATION
As noted above, chemical information databases will hold chemical structures in addi-
tion to traditional data types (text, numeric, dates, etc.). Furthermore, the traditional
data types will usually represent some properties of the molecules. Examples might
include molecular descriptors, fingerprints, assay readouts, and so on. Given these
varied data types, efficient indexing plays an important role in allowing fast queries.
One of the key issues that face choice of indexing scheme is the intended query. Thus,
for example, if one were simply retrieving records based on a textual compound ID,
a single B-tree [32] index on the relevant column would provide a time complexity
of O(log n) for searches. On the other hand, similarity searches require an indexing
scheme that is capable of performing efficient near-neighbor searches, in possibly
highly multidimensional spaces.
We first consider how one might employ indexing to provide efficient query times
when searching for chemical structures. Ignoring the trivial case of retrieving struc-
tures based on some textual ID, we focus on how structure and substructure searches
can be improved by an indexing scheme. Searching for exact matches to a query
molecule can benefit from standard hash indexes. Depending on the nature of the
structure representation, this may require some form of canonicalization of the query
molecule (as well as for the stored molecules, possibly at registration time). Thus, for
example, one can store the molecules in a text field using a SMILES representation.
Assuming that they are appropriately canonicalized, one can then identify entries that
exactly match a query molecule by performing a string equality search. If this field is
indexed by a B-tree index, this will be very fast. Given that canonicalization methods
are specific to a given toolkit, a more generalized solution that is independent of any
specific toolkit is to employ InChIs for structure representation. Since this is a plain
text format, this provides the same advantages as SMILES. But in addition, InChIs
for two forms of the same molecule will always be the same since there is only one
implementation of the algorithm.