
Statistics
and
Data
Analysis in
Geology
-
Chapter
6
measurement units.
As
an
extreme example, we might measure three perpendicular
axes on a collection of pebbles.
If
we measure two of the axes in centimeters and the
third
in
millimeters, the third
axis
will
have proportionally ten times more influence
on the distance coefficient than either of the other two variables.
Other measures of similarity that are less commonly used in the Earth sci-
ences include a wide variety of
association coefficients
which are based on binary
(presence-absence) variables or a combination of binary and continuous variables.
The most popular of these are the
simple matching coefficient, Jaccard’s coeffi-
cient,
and
Cower’s coefficient-all
ratios of the presence-absence of properties.
They differ primarily
in
the way that mutual absences (called “negative matches”)
are considered. Sneath and Sokal (1973) discuss the relative merits
of
these
and
other coefficients of association.
Probabilistic similarity coefficients
are used with
binary data and consider the gain or loss of information when objects are combined
into clusters. Again, Sneath and
Sokal(1973)
provide a comprehensive summary.
Computation of a similarity measurement between
all
possible pairs of objects
will
result in
an
n
x
n
symmetrical matrix,
C.
Any coefficient
Cij
in the matrix gives
the resemblance between objects
i
and
j.
The next step is to arrange the objects
into a hierarchy
so
objects with the highest mutual similarity are placed together.
Then groups or clusters
of
objects are associated with other groups which they
most closely resemble, and
so
on until
all
of the objects have been placed into a
complete classification scheme.
Many
variants of clustering have been developed; a
consideration of all of the possible alternative procedures and their relative merits
is
beyond the scope of this book. Rather, we will discuss one simple clustering
technique called the
weighted pair-group method
with
arithmetic averaging,
and
then point out some useful modifications to this scheme.
Extensive discussions of hierarchical and other classification techniques are
contained in books by Jardine and Sibson (1971), Sneath and Sokal (1973),
Har-
tigan (19751, Aldenderfer and Blashfield (1984), Romesburg (1984), Kaufman
and
Rousseeuw (1990), Backer (1995), and Gordon (1999). Diskettes containing cluster-
ing programs are included in some of the these books or are available separately at
modest cost.
In
addition, most personal computer programs for statistical analysis
contain modules for hierarchical clustering.
Table
6-8
contains measurements made on
six
greywacke thin sections, iden-
tified as
A,
B,
.
. .
,
F.
The values represent the average of the apparent maximum
diameters of ten randomly chosen grains of quartz, rock fragment, and feldspar
and the average of the apparent maximum diameters of ten intergranular pores in
each thin section. The table also gives
a
symmetric matrix of similarities,
in
the
form of “correlation” coefficients calculated between the
six
thin sections.
The first step in clustering by a pair-group method is to find the mutually
highest correlations in the matrix to form the centers of clusters. The highest
correlation (disregarding the diagonal element) in each column of the matrix
in
Table
6-8
is shown in boldface type. Specimens
A
and
B
form mutually high pairs,
because
A
most closely resembles
B,
and
B
most closely resembles
A.
C
and
D
also
form mutually high pairs.
E
most closely resembles
D,
but these two do not form
a mutually high pair because
D
resembles
C
more than it does
E.
To qualify as a
mutually high pair, coefficients
Cij
and
Cji
must be the highest coefficients in their
respective columns.
We can indicate the resemblance between our mutually high pairs in a diagram
such as
Figure
6-5
a.
Object
C
is connected to
D
at a level of?
=
0.99, indicating
490