Molecular Descriptors 109
4.4.2.4 Fingal
The Fingerprinting Algorithm (FingAl) published by Brown et al. [44] is an extension
of the path-based hashed fingerprintsin which geometrical information is incorporated
into the paths. This is achieved by an adaption of the atom types that are used in the
paths. Each atom type on the path is augmented with its geometrical distance to
the previous atoms of the path. The distances are binned into a set of 10 distance
classes because it is unlikely that the distances are exactly identical for two paths for
a molecular comparison regarding the amount of identical substructures (paths). In
the original publication, the boundaries of the bins are {2.35, 2.71, 3.69, 4.47, 5.25,
6.39, 7.59, 9.32, 12.44}. This leads to a better discrimination of the molecules but
introduces a bias caused by the conformation. In the original work the –3D structure
is computed by CORINA [47].
4.4.3 HASHING
Hashing denotes the mapping of the element of a space χ, which is not necessar-
ily numeric, to an integer (often restricted to a specific interval). The mapping is
conducted using a deterministic hash function h : χ → Z. An ideal hash function is
injective, which means that the hashes of two inputs are equal only if the inputs are
equal. Although this is a preferable property, most hash functions are not injective
because the input space is larger than the integer set it is mapped to.
Many hashing algorithms work on bit representations of data objects. Each object
is considered as a stream of bits that are subsequently hashed (mostly blockwise) into
other bit sets.
4.4.3.1 Cyclic Redundancy Check
The cyclic redundancy check (CRC) published by Peterson and Brown [48] is a
function that maps a sequence of bit inputs to an integer (its binary represen-
tation) in a specific interval. The key idea is to define a generator polynomial
p(x) =
k
i=0
c
i
x
i
with coefficients c ∈{0, 1} and degree k which can be expressed
as a bit sequence b of k +1 bits, such that b
i
= c
i
. Analogously, the input bit
sequence is subdivided into overlapping blocks of length k +1 and extended with
zeros if necessary. The hash value results from a blockwise modulo 2 polynomial
division of the input sequence by the generator polynomial. Frequently used genera-
tor polynomials are CRC − 32 = x
32
+x
26
+x
23
+x
22
+x
16
+x
12
+x
11
+x
10
+
x
8
+x
7
+x
5
+x
4
+x
2
+x
1
+x
0
or CRC − 16 = x
16
+x
12
+x
5
+x
0
. An outline
of the algorithm is given in Algorithm 4.7.
ALGORITHM 4.7 PSEUDOCODE FOR THE CRC HASH FUNCTION
(ADAPTED FROM WIKIPEDIA)
method getCRCvalue(InputBitSet B, GeneratorBitSet G)
{
ShiftingRegister R := {00000....} //
Initialization
while B is not finished do