Correlations - and Distances - Based Approaches to Static Analysis… 57
IV.2.4.2. Qualitative Variables and Similarity Indices
For qualitative data (binary, counting), many similarity indices (SI) could be used as
intuitive measures of the closeness between individuals: Jaccard, Sorensen-Dice, Tanimoto,
Sokal-Michener indices, etc. (Jaccard , 1912; Duatre et al., 1999; Rouvray, 1992). The
similarity indices are less sensitive to the null values of the variables, and thus they are useful
in the case of sparse data. To evaluate similarity between two individuals X1 and X2, we need
three or four essential elements: a = number of shared characterisrics; b = number of
characteristics present in X1 and absent in X2; c = number of characteristics present in X2 and
absent in X1; d = number of characteristics absent both in X1 and X2 (required for some SI).
The different SI can be converted into dissimilarity D according to the formula:
- D = 1 – SI if SI ∈ [0, 1]
-
2
1 SI
D
−
=
if SI ∈ [-1, 1]
To illustrate the concept of similarity index, let’s give a numerical example concerning
three metabolic profiles characterized by 10 metabolites the concentration of which are not
known (Figure 46). In such case, quantitative data (concentrations) are not available, and
consequently, distances can’t be computed. However, information on presence/absence of
metabolites j in the different profiles X
i
can be used to calculate SI between the profiles.
IV.2.5. Clustering Techniques
After computation of distances or dissimilarities between all the individuals of the dataset
(e.g. metabolic profiles), it becomes possible to merge them into homogeneous and well
separated groups by using an aggregation algorithm: initially, the most close (the less distant)
individuals are merged to give a group. After the apparition of some small groups, the
immediate next step consists in merging the most similar groups into larger groups by
reference to a certain homogeneity criterion (aggregation rule). Such procedure is iteratively
applied until all the individuals/groups are merged into one entity; the most separated
(dissimilar) groups will be merged at the final step of the clustering procedure. This leads to a
hierarchical stratification of the whole population into well homogeneous and separated
groups (called clusters).
For the clustering procedure, there are several aggregation algorithms which are based on
different homogeneity criteria. Two clustering principles will be illustrated here: distance-
based (a) and variance-based (b) clustering. The distance-based clustering will be illustrated
by four algorithms (single, average, centroid and complete links) (Figure 48), whereas the
variance-based clustering will be illustrated by one method (Ward method or second order
moment algorithm) (Figure 47) (Ward, 1963; Everitt, 2001; Gordon, 1999; Arabie, 1996).