Molecular Descriptors 91
detailed account on the most important types, principles, and algorithms. An encyclo-
pedia that covers most of the important molecular descriptors can be found in Ref. [1].
A molecular descriptor is an abstract, in most cases numerical, property of a molec-
ular structure, derived by some algorithm describing a specific aspect of a compound.
There are many ways to define descriptor classes. The most important object is to
differentiate between the structural representations used as input. The simplest types
are one-dimensional descriptors (0D and 1D) that only depend on the molecular for-
mula, such as molecular mass or the numbers of specific elements. The net charge of
a molecule is often regarded as a 1D descriptor. Most descriptors consider the molec-
ular topology (i.e., the structural formula). These are considered as two-dimensional
(2D) descriptors like most of the graph theory-based descriptors. Descriptors that also
regard the spatial structure are defined as three-dimensional (3D). This class consists,
for instance, of molecular interaction field (MIF)-based approaches, but also methods
that make use of Euclidean distances. Further descriptor classes that have been intro-
duced consider, for example, different conformations or molecular dynamics. Their
dimensionality cannot be expressed in a similar intuitive way; sometimes we can find
acronyms like four-dimensional (4D) or five-dimensional (5D) for such methods.
4.2 GRAPH DEFINITIONS
Most of the descriptors we will present in this chapter are at least 2D and therefore
make use of the molecular topology. In such approaches, a molecule is often regarded
as a graph annotated with complex properties, often using an unrestricted label alpha-
bet. This flexible definition allows us to apply all kinds of structured data algorithms
based on graphs [2], which also covers feature-reduced molecular graphs.
Definition4.1: Given a node label alphabet L
v
and an edge label alphabet L
E
,we
define a directed attributed graph g by the four-tuple g = (V, E, μ, ν), where
• V defines a finite set of nodes
• E ⊆ VxV denotes a set of edges
• μ : V → L
V
denotes a node labeling function
• ν : E → L
E
denotes an edge labeling function
The set V of nodes can be regarded as a set of node attributes of size
|
V
|
.
The set E of edges defines the structure and (edge) density of the graph. A con-
nection from node v ∈ V to node u ∈ V is formed by e = (u, v),ife ∈ E. A labeling
function allows integrating information on the nodes or edges by using L
v
and L
E
.
In theory, there is no restriction to the label alphabet. Nevertheless, for practical rea-
sons the label alphabet is restricted to a vector space of a limited dimension L = R
k
,
or a discrete set of symbols, L ={s
1
, ..., s
k
}. Other definitions of labels might also
contain information such as strings, trees, or graphs, as an alphabet reduction may
impose constraints on the application domain, allowing a more flexible encoding.
Although there are various labeling functions for molecular graphs possible, there
are still ongoing discussions for a standard definition (http://blueobelisk.sourceforge
.net, http://opensmiles.org/). Due to differences in chemoinformatics perception