534 15. Similarity and Diversity in Chemical Design
15.3.2 The Compound Descriptors
Each compound in the database is characterized by a vector (the descriptor). The
vector can have real or binary elements. There are many ways to formulate these
descriptors so as to reduce the database search time and maximize success in
generation of lead compounds.
Conventionally, each compound i is described by a list of chemical descrip-
tors, which may reflect molecular composition, such as atom number, atom
connectivity, or number of functional groups (like aromatic or heterocyclic rings,
tertiary aliphatic amines, alcohols, and carboxamides), molecular geometry,such
as number of rotatable bonds, electrostatic properties, such as charge distribution,
and various physiochemical measurements that are important for bioactivity.
These descriptors are currently available from many commercial packages
like Molconn-X and Molconn-Z (Hall Associates Consulting, Qincy, MD).
Descriptors fall into many classes. Examples include:
2D descriptors — also called molecular connectivity or topological indices —
reflecting molecular connectivity and other topological invariants;
binary descriptors — simpler encoded representations indicating the presence
or absence of a property, such as whether or not the compound contains at
least three nitrogen atoms, doubly-bonded nitrogens, or alcohol functional
groups;
3D descriptors — reflecting geometric structural factors like van der Waals
volume and surface area; and
electronic descriptors — characterizing the ionization potential, partial atomic
charges, or electron densities.
See also [8] for further examples.
Binary descriptors allow rapid database analysis using Boolean algebra op-
erations. The MolConn-X and MolConn-Z programs, for example, generate
topological descriptors based on molecular connectivity indices (e.g., number of
atoms, number of rings, molecular branching paths, atoms types, bond types, etc.).
Such descriptors have been found to be a convenient and reasonably successful
approximation to quantify molecular structure and relate structure to biological
activity (see review in [6]). These descriptors can be used to characterize com-
pounds in conjunction with other selectivity criteria based on activity data for a
training set (e.g., [322, 582]). The search for the most appropriate descriptors is
an ongoing enterprise, not unlike force-field development for macromolecules.
The number of these descriptors, m, is roughly on the order of 1000, thus
much smaller than n (the number of compounds) but too large to permit standard
systematic comparisons for the problems that arise.
Let us define the vector Xi associated with compound i to be the row m-vector
{Xi
1
,Xi
2
,...,Xi
m
}.