PRINCIPAL COMPONENTS ANALYSIS 301
The fact that principal components analysis starts with correlation coefficients is
important. As we saw in Chapter
22, a number of different similarity coefficients
have been devised for dealing with similarities between cases with different sorts
of variables. Correlation and regression, as we saw in Chapter
15, is built on scatter
plot logic and most suitable for measurements. If all the variables in a multivariate
dataset are measurements, then looking at the relationships between them by way of
correlation coefficients makes sense. It makes less sense if some of the variables are
ranks or categories. In practice, principal components analysis often does produce
sensible and valid results even when the variable set does not consist purely of mea-
surements. It should not be too surprising that variables that are ranks rather than
true measurements are not especially threatening to principal components analysis.
As we saw in Chapter 16, rank order correlation coefficients are a better tool for
relating ranks than regression and correlation, but a correlation coefficient (r) gives
a decent approximate assessment of the degree of correlation between variables that
are ranks.
Unranked categories are a different proposition. The scatter plot logic of regres-
sion and correlation means that values of 1 and 3 are treated not only as more
different than values of 1 and 2, but also as twice as different. (The difference
between 1 and 3 is 2, and the difference between 1 and 2 is 1.) We faced a very
similar problem in thinking about Euclidean distance in Chapter
22. We can con-
sider, as we did before, the possibility that the Ixcaquixtla household dataset had a
variable for type of wall construction, and that the categories were wattle-and-daub,
wood-plank, and mud-brick, assigned values of 1, 2, and 3, respectively. It does not
seem at all reasonable to treat 1 and 3 as any more different than 1 and 2, but cor-
relation coefficients (like Euclidean distances) inevitably will do this. This kind of
category variable with multiple unranked categories is truly unsuitable for measure-
ment of relationships with other variables by way of correlation coefficients and thus
is truly unsuitable for principal components analysis. We came to the same conclu-
sion about Euclidean distances, and the same solution discussed there is potentially
applicable in principal components analysis. The three categories of kinds of wall
construction can be reorganized into three separate presence/absence variables.
Category variables with two categories (including presence/absence variables),
of course, are also not the most suitable fodder for regression and correlation. If the
question is simply to assess the strength and significance of the relationship between
a two-category variable and some other variable, we would not choose regression
and correlation. Principal components analysis, however, must begin with correla-
tions, and it turns out that correlations, while providing only a blunt instrument for
assessing the strength of relationships involving two category variables, can provide
an acceptable rough approximation.
Imagine the scatter plot we would draw to explore the relationship between two
presence/absence variables. Since the values of each of these two variables would be
limited to 0 and 1, there are only four places in a scatter plot where points could fall:
where x = 0andy = 0 (the origin of the graph at its lower left corner), where x = 1
and y = 1 (the upper right corner), where x = 1andy = 0 (the lower right corner),
and where x = 0andy = 1 (the upper left corner). If the two variables are strongly