40 Handbook of Chemoinformatics Algorithms
As computers became more powerful and capable of handling much larger charac-
ter sets, newer line notations that can encode chemical concepts, describe reactions,
and be stored in relational databases have become prevalent. More than simply short-
hand for molecular formulas, these linear systems are linguistic structures that can
achieve multiple complex chemoinformatics objectives. SMILES, SMARTS (SMiles
ARbitrary Target Specification), and SMIRKS are related chemical languages that
have been used in applications such as virtual screening, molecular graph mining, evo-
lutionary design of novelcompounds, substructure searching, and reaction transforms.
SMILES is a language with simple vocabulary that includes atom and bond sym-
bols and a few rules of grammar. SMILES strings can be used as words in other
languages used for storage and retrieval of chemical information such as HTML,
XML, or SQL.A SMILES string for a molecule represents atoms using their elemental
symbols, with aliphatic atoms written in uppercase letters and aromatic atoms in
lowercase letters. Except in special cases, hydrogen atoms are not included. Square
brackets are used to depict elements, such as [Na] for elemental sodium. However,
square brackets may be omitted for elements from the organic subset (B, C, N, O, P,
S, F, Cl, Br, and I), provided the number of hydrogen atoms can be surmised from
the normal valence. Thus, water is represented as O, ammonia as N, and methane
as C. Bonds are represented with – (single), = (double), # (triple), and : (aromatic),
although single and aromatic bonds are usually left out. Simple examples are CC
for ethane, C=C for ethene, C=O for formaldehyde, O=C=C for carbon dioxide,
COC for dimethyl ether, C#N for hydrogen cyanide, CCCO for propanol, and [H][H]
for molecular hydrogen. Some atomic properties may also be specified using square
brackets, for example, charge ([OH
−
] for hydroxyl ion) and atomic mass for isotopic
specification ([13CH
4
] for C-13 methane).
A SMILES string is constructed by visiting every atom in a molecule once.
A branch is included within parentheses and branches can be nested indefinitely.
For example, isobutane is CC(C)C and isobutyric acid is CC(C)C(=O)O. Ring
structures are treated by breaking one bond per cycle and labeling the two atoms
in the broken bond with a unique integer (cf. Figure 2.1). Thus, C1CCCCC1 is
cyclohexane, c1ccccc1 is benzene, n1ccccc1 is pyridine, C1=CCC1 is cyclobutene,
and C12C3C4C1C5C4C3C25 is cubane in which two atoms have more than one ring
(a)
(b)
(c) (d) (e)
C
1
C
1
CC CC C
CC
C
C
C
12
C
12
C
1
C
12
C
3
C
4
C
4
C
3
C
25
C
5
C
1
C
3
C
23
C
2
C
1
FIGURE 2.1 SMILES strings are constructed by traversing each atom in a molecule once.
Rings are depicted by first breaking a bond and then including an integer after the two atoms
present in the broken bond. The numbering may change with each addition of a ring. The con-
struction of a SMILES string for cubane is shown. (a) Structure of cubane with the position of
the starting atom marked with a dot; (b) C1CCC1; (c) C12CCC1CC2; (d) C12CCC1C3CCC23;
and (e) C12C3C4C1C5C4C3C25.