the matrix indicates the mutability of the corresponding
amino acid, whereas the off-diagonal elements indicate
their exchange probabilities.A neutral (random) score is 0,
whereas an amino acid pair with a score of –3 exchanges
with only 10
3/10
0.50 of the frequency expected at ran-
dom. This substitution matrix has been arranged such that
the amino acid residues most likely to replace each other in
related proteins (the pairs that have the highest log R
ij
val-
ues) are grouped together. Note that this grouping is more
or less what is expected from their physical properties.
Identities (no replacement) tend to have the highest val-
ues in Table 7-7. Trp and Cys (diagonal values 17 and 12)
are the residues least likely to be replaced, whereas Ser,
Ala, and Asn (all 2) are the most readily mutated. The
residue pair least likely to exchange is Cys and Trp (8),
whereas the pair most likely to exchange is Tyr and Phe (7),
although these latter residues are among the least likely to
exchange with other residues (mostly negative entries).
Similarly, charged and polar residues are unlikely to ex-
change with nonpolar residues (entries nearly always neg-
ative).
The confidence that one can align sequences known to
be distantly related has been investigated as a function of
PAM values (N). The PAM-250 log odds substitution ma-
trix tends to yield the best alignments, that is, the highest
alignment scores relative to those derived using substitu-
tion matrices based on larger or smaller PAM values. Note
that Fig. 7-26b indicates that at 250 PAMs, 80% of the
residues in the original polypeptide have been replaced.
e. Sequence Alignment Using the
Needleman–Wunsch Algorithm
The use of a log odds substitution matrix to find an
alignment is straightforward (although tedious). When
comparing two sequences, rather than just making a matrix
with 1’s at all matching positions, one enters the appropri-
ate value in the log odds substitution matrix at every posi-
tion. Such a matrix represents all possible pair combina-
tions of the two sequences. In Fig. 7-30a, we use the
PAM-250 log odds matrix with a 10-residue peptide hori-
zontal and an 11-residue peptide vertical. Thus, the align-
ment of these two peptides must have at least one gap or
overhang, assuming a significant alignment can be found
at all.
An algorithm for finding the best alignment between
two polypeptides (that with the highest log odds value) was
formulated by Saul Needleman and Christian Wunsch.
One starts at the lower right corner (C-termini) of the
matrix, position (M, N) (where in Fig. 7-30a, M 11 and
N 10),and adds its value (here 2) to the value at position
(M 1, N 1) [here 12, so that the value at position (M 1,
N 1), that is, (10, 9), becomes 14 in the transformed
matrix]. Continuing this process in an iterative manner,add
to the value of the element at position (i, j) the maximum
value of the elements (p, j 1), where p i 1, i 2, …,
M, and those of (i 1, q), where q j 1, j 2, …, N.Fig-
ure 7-30b shows this process at an intermediate stage with
the original value of the (6, 5) position in a small box and
the transformed values of the positions (p, 6), where p 7,
8, …, 11, together with positions (7, q), where q 6, 7, …,
10, in the L-shaped box. The maximum value of the matrix
elements in this L-shaped box is 19 and hence this is the
value to add to the value (0) at position (6, 5) to yield the
value 19 in the transformed matrix.This process is iterated,
from the lower right toward the upper left of the matrix,
until all its elements have been so treated, yielding the fully
transformed matrix shown in Fig. 7-30c. The Needle-
man–Wunsch algorithm thereby yields the log odds values
for all possible alignments of the two sequences.
The best alignment (that with the highest log odds
value) is found by tracing the ridgeline of the transformed
matrix (Fig. 7-30c) from its maximum value at or near the
upper left (N-terminus) to that at or near the lower right
(C-terminus). This is because the alignment of a particular
residue pair is independent of the alignment of any other
residue pair, and hence the best score up to any point in an
alignment is the best score through the previous step plus
the incremental score of the new step.This additive scoring
scheme is based on the assumption that mutations at differ-
ent sites are independently accepted, which appears to be
an adequate characterization of protein evolution even
though specific interactions between residues are known to
have critical structural and functional roles in proteins.
The line connecting the aligned residue pairs (those cir-
cled in Fig. 7-29c) must always extend down and to the
right. This is because a move up or to the left, or even
straight down or straight to the right, would imply that a
residue in one peptide aligned with more than one residue
in the other peptide.Any allowed deviation from a move of
(1, 1) implies the presence of a gap.The best alignment
of the two polypeptides, that connected by the lines in Fig.
7-30c, is indicated in Fig. 7-30
d. Note that this alignment is
ambiguous; the alignment of S in the 10-mer with either T
Section 7-4. Bioinformatics: An Introduction 199
CysC
SerS02
ThrT 213
ProP 3106
AlaA 211 12
GlyG 31011 5
AsnN 4101002
AspD 50010124
GluE 500100134
GlnQ 5 1 10011224
HisH 3 1 101 221136
ArgR 40102 301 1126
LysK 50011 21001035
MetM 5 2 1 21 3 2 3 2 1 2006
IleI 2 1021 3 2 2 2 2 2 2 225
LeuL 6 3 2 3 2 4 3 4 3 2 2 3 3426
ValV 2 10101 2 2 2 2 2 2 22424
PheF 4 3 3 54 5 4 6 5 5 2 4 501219
TyrY03 3 5 3 5 2 4 4 404 4 2 1 1 2710
TrpW 8 2 5 66 7 4 7 7 5 323 4 5 2 600
CSTPAGNDEQHRKM ILVFY
Cys Ser Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr
17
W
Trp
12
Table 7-7 The PAM-250 Log Odds Substitution Matrix
Source: Dayhoff, M.O. (Ed.), Atlas of Protein Sequence and Structure, Vol.
5, Supplement 3, p. 352, National Biomedical Research Foundation
(1978).
JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 199