Nabil Semmar 74
The new coordinates resulting from row and column analyses have the characteristic to
condense the variability of the initial dataset within a small dimension space (<p) consisting
of independent directions (called factors). The factors have also the property to be
successively shorter because they correspond to decreasing eigenvalues; this makes possible
to describe the variability of the initial dataset by a minimal dimension space represented by
the first factors (Escofier and Pagès, 1991): the first factor (F1) describes the maximal part of
total variability followed be the second (F2) which describes a maximal part of the remaining
variability not described by F1, etc. . This leads the variability of the dataset to be rapidly
condensed into a small dimension space. This is particularly interesting in the case of large
datasets, what is generally the case in metabolomics.
The computations of factorial coordinates are illustrated by a numerical example based
on the previous dataset (Figure 55) (Figures 62, 63). After the calculus of factorial
coordinates of the rows along each factor, their sign must correspond to those of the
coordinates of the eigenvectors for the columns: for instance, along F1, the eigenvector of
column is V1 with five coordinates (0.58, -0.12, 0.07, -0.17, -0.78) (Figure 63); the calculus
of factorial coordinates of the five rows along F1 gives (-0.59, 0.27, -0.14, 0.24, 1.44); as the
two sets have opposite signs, it is needed to multiply one of them by -1 to obtain appropriate
superimposition between rows and columns: Thus F1 becomes F1(0.59, -0.27, 0.14, -0.24,
-1.44) (Figure 62). According to the dataset, such sign correction can or can’t occur.
To measure the distance between two row-profiles or two column-profiles, CA uses the
chi-square distance. The distance between two row profiles (e.g. two patients) i and i’ is given
by (Escofier and Pagès, 1991; Greenacre, 1984; 1993):
∑
=
+++
++
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−=
p
j
i
ji
i
ij
j
x
x
x
x
x
x
iid
1
2
'
'
2
)',(
(eq. 5),
where x
++
is the total sum of the whole database, x
i+
, x
i’+
are the sums of rows i and i’,
respectively, and x
+j
is the sum of column j.
This distance is low when the profiles show similar relative values of several variables,
independently of their absolute values (Figure 45). Similarly, the distance between two
column profiles (e.g. two metabolite variables) j and j’ is given by:
∑
=
+++
++
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
−=
n
i
j
ij
j
ij
i
x
x
x
x
x
x
jjd
1
2
'
'
2
)',(
(eq. 6)
V.6.3.3. Graphical Interpretation of CA Results and Outlier Diagnostic
Graphical visualization of the factorial coordinates of rows helps to see how much each
individual tends to be original or ordinary within the population. Moreover, the scatter plot of
the factorial coordinates of columns helps to identify how the different variables are
associated to original individuals: an individual which projects close to a variable means a
high value in such individual for such a variable compared with all the individuals and
variables of the dataset. Graphically, outliers can be highlighted by extreme points along the