180 Constructing Phylogenetic Trees
5.2. Tree Construction: Distance Methods – Basics
In constructing a phylogenetic tree, the taxa we wish to relate are usually
ones currently living. We have information, such as DNA sequences, from
the terminal taxa and no information from the ones represented by internal
vertices. Indeed, we do not even know which internal vertices should exist,
because we do not yet know the tree topology.
The first class of methods for constructing phylogenetic trees that we will
discuss are distance methods. These attempt to build a tree using information
that we believe describes the total distances between terminal taxa along the
tree.
To see how we might obtain these distances, imagine trying to find the
evolutionary relationship of four species: S1, S2, S3, and S4. Choosing a
particular orthologous stretch of DNA from their genomes, we obtain and
align sequences from each. If the Jukes-Cantor model of base substitution
discussed in Chapter 4 seems appropriate for the data, we then compute Jukes-
Cantor distances between each pair of sequences. These are our estimates of
distances along the tree, which we organize in Table 5.2.
Table 5.2. Distances Between Taxa
S1 S2 S3 S4
S1 .45 .27 .53
S2 .40 .50
S3 .62
Depending on the sequence data, we might instead adopt a different model
of base substitution, leading us to use a different distance formula, such as
the Kimura 2-parameter or the log-det distance. Regardless, the distance we
calculate between sequences is believed to be a measure of the amount of
mutation that has occurred. If these distances were an exact measure of the
amount of mutation that occurred, they would match up with the total distances
between terminal taxa in the metric tree we would to find.
We do not really expect to find a tree that this data fits exactly; after all, the
distances are inferred from sequence data and are not expected to be exactly
correct. Moreover, the method of inferring the distances depended on a model
that involved assumptions that are certainly not met in real organisms. We
hope that however we construct a tree will not be too sensitive to these sorts
of errors in the distances.
UPGMA. The first method we consider is called the average distance
method, or, more formally, the unweighted pair-group method with arithmetic