192 Handbook of Chemoinformatics Algorithms
iii. Select the next compound randomly or in the order it is included in the dataset.
iv. Calculate similarity of this compound with compounds already selected.
v. If all similarity values are below the similarity threshold, include this com-
pound into the list of diverse subset of compounds. If at least one similarity
value is above this threshold, do not include this compound in the list.
vi. If no more compounds left, stop. Otherwise, go to step (iii).
If m compounds are selected out of the entire dataset of Mcompounds, the total
number of distances calculated will be less than s = mM/2. If m M, then s
M(M −1)/2, that is, the number of distances calculated is much smaller than the
total number of distances.
Note 1: In steps (i) and (iii), compounds can be selected randomly or in the order
they are included in the dataset. Actually, compounds can be selected randomly if
before each run the dataset is randomized, and then each compound is selected in the
order it is included in the randomized dataset.
Note 2: Different similarity measures can be used in this algorithm. For example,
the Tanimoto coefficient can be used. If some distance measure like Euclidean
distance is used, the higher value means higher dissimilarity, not similarity, so in
step (iv) a compound is added to the list if all distances to compounds already in the
list are above the threshold.
Note 3: Since it is unknown a priori how diverse the compounds included in the
dataset are, the procedure should be performed with different threshold values. For
example, a user wants to select a subset of m compounds. Then the user can perform
the procedure with two significantly different threshold values and see how many
compounds are selected. If m
1
and m
2
are the sizes of subsets selected, and threshold
values were T
1
< T
2
, then it would be logical to select the new value T
3
as
T
3
= T
1
+
m −m
1
m
2
−m
1
(T
2
−T
1
). (6.32)
The process can be repeated until a number of compounds selected will be sufficiently
close to m. Formula 6.32 may not work if m
1
and m
2
are close to each other. In practice,
in the beginning, calculations with the range of m
1
and m
2
values and a relatively large
range of similarity values can be performed. Then, based on the results, more precise
calculations can be performed for narrower ranges of parameters.
Note 4: After selection of the diverse subset, which is many times smaller than
the entire dataset, additional runs can be performed to select compounds similar to
the selected compounds. The threshold for these runs could be the same as that for the
selection of the diverse subset. If the number of compounds in the entire dataset is M,
and the number of compounds selected is m, and m M, then the maximum num-
ber of distances calculated (the number of elements of the distance matrix between
selected and not selected compounds) will be m(M − m), which will be much smaller
than M(M − 1)/2. Still, if the procedure was run a small number of k times, the total
number of distances calculated is kmM/2 +m(M − m) M(M − 1)/2. In this way,
the dataset can be divided into m initial clusters. If some of the M − m compounds
are close to more than one compound of the diverse subset, they can be assigned to a