43
Normalization of Gene-Expression Microarray Data
normalization, whose goal is to impose to each array the same
empirical distribution of intensities. The distribution of within-
gene averages is usually used as the target or the reference.
Mathematically, the procedure applies a transformation
F
-1
(G
i
(y)), where G
i
is the cumulative distribution of intensities in
the array i, and F is the reference distribution. The algorithm
itself is very simple; intensities in each array are first ranked in
increasing order. Each quantile value in then substituted by the
corresponding quantile in the reference distribution. Finally, val-
ues are brought back to the original order. Using only the obser-
vation ranks, the algorithm is able to deal with a nonlinear trend,
and runs quite fast. Where several replicates of the same gene
intensities are available (e.g., Illumina and Affymetrix), the algo-
rithm is usually run before summarization, thus exploiting more
information and possibly with a better estimation of the real
underlying distribution of gene intensities.
Most commonly-used normalization procedures use the whole
set of genes, under the assumption that the great majority of
genes are fairly invariant across arrays. Nevertheless, this assump-
tion is often questionable, especially in experiments where a large
variation in expression profiles is expected. To overcome this
problem, the housekeeping-gene approach borrows the idea from
standard laboratory procedures (e.g., Northern blot or quantita-
tive RT-PCR), where an internal control is used for data normal-
ization. It assumes that some (not all) genes are similarly expressed
across arrays, so that they can be used as a reference for the rela-
tive expression levels of other genes. For example, Affymetrix
platforms include a set of control probes of housekeeping genes
(e.g., b-Actin, GAPDH and others).
However, there is a serious concern about the assumption of
invariant expression of the so-called housekeeping genes as they are
often affected by various factors that are not controlled in the
experiment. Also, those genes are usually highly expressed, thus
not representing genes of low intensities. Furthermore, they are
usually a very small subset of the whole array chip, so fluctuations
in their intensities are highly affected by random or systematic
errors. Any normalization based on such a limited number of inter-
nal references would be unreliable. Therefore, normalization based
on housekeeping genes selected a priori is not recommended.
A possible variation of the same framework is to use spiked-in
control spots with genetic material from unrelated species. Again
several problems arise with such an approach. First, spike-ins are
added into the sample at a different stage of cDNA preparation,
so that intensity levels of spike-ins are subject to less experimental
variation than the naturally expressed transcripts of comparable
abundance. Second, nonspecific hybridization cannot be excluded,
though might be reduced with careful probe design. Finally, a
3.5. Housekeeping-
Gene Normalization