Smith–Waterman algorithm exploits the property of the
substitution matrix-based scoring system that the cumula-
tive score for an alignment path decreases in regions in
which the sequences are poorly matched.Where the cumu-
lative score drops to zero, the Smith–Waterman algorithm
terminates the extension of an alignment path. Two pep-
tides may have several such local alignments.
f. Gap Penalties
If there are gaps in the alignment, one should now sub-
tract the gap penalty from the overall alignment score to
obtain the final alignment score. Since a single mutational
event can insert or delete more than one residue, a long gap
should be penalized only slightly more than a short gap.
Consequently, gap penalties have the form a bk, where a
is the penalty for opening the gap, k is the length of the gap
in residues, and b is the penalty for extending the gap by
one residue. Current statistical theory provides little guid-
ance for optimizing a and b, but empirical studies suggest
that a 8 and b 2 are appropriate values for use with
the PAM-250 matrix. Thus the final alignment score
for both alignments in Fig. 7-29d (which have both a
1-residue gap and a 2-residue gap) is 41 (8 2 1)
(8 2 2) 19.
g. Pairwise Alignments Using BLAST
The Needleman–Wunsch algorithm and later the
Smith–Waterman algorithm (in their computerized forms)
were widely used in the 1970s and 1980s to find relationships
between proteins. However, the need to compare every
newly determined sequence with the huge and rapidly grow-
ing number of sequences in publicly available databases re-
quires that this process be greatly accelerated. Modern com-
puters can do so using sequence alignment programs that
employ sophisticated heuristic algorithms (algorithms that
make educated “guesses”) but at the risk of obtaining subop-
timal results (in the case of sequence alignments, the heuris-
tic algorithms are based on knowledge of how sequences
evolve). Consequently, in what follows, we shall describe how
these programs are used rather than how they work.
The PAM-250 substitution matrix is based on an extrap-
olation: Its calculation assumes that the rate of mutation
over one PAM unit of evolutionary distance is the same as
that over 250 PAM units. This may not be the case. After
all, homologous proteins that are separated by large evolu-
tionary distances may diverge in their function and hence
their rates of evolution may change (recall that different
proteins have different rates of evolution; Fig. 7-23).To ac-
count for this possibility, and because of the huge amount
of sequence data that had become available since the PAM
matrices were calculated in the mid-1970s, a log odds sub-
stitution matrix based on ⬃2000 blocks of aligned sequences
that lacked gaps taken from ⬃500 groups of related proteins
was calculated. The substitution matrix that gives the most
sensitive performance for ungapped alignments is called
BLOSUM62 [blosum (pronounced “blossom”) for block
substitution matrix; the 62 indicates that all blocks
of aligned polypeptides in which there is 62% identity
are weighted as a single sequence in order to reduce con-
tributions from closely related sequences], whereas
BLOSUM45 appears to perform better for alignments with
gaps. Sequence alignments based on the BLOSUM62 or
BLOSUM45 matrices are more sensitive than are those
based on the PAM-250 matrix.
BLAST (for basic local alignment search tool) is the
most widely used, publicly available software package for
making pairwise sequence alignments—both for polypep-
tides and for polynucleotides. This program uses a heuristic
approach that approximates the Smith–Waterman algo-
rithm so as to obtain the optimum mix of sensitivity (the
ability to identify distantly related sequences) and selectiv-
ity (the avoidance of unrelated sequences with spuriously
high alignment scores). It pairwise aligns up to a user-
selected number of subject sequences (default 100) in the
chosen database(s) that are the most similar to the query se-
quence. BLAST, which was originated by Stephen Altschul,
is publicly available, free of charge, for interactive use over
the Web (http://www.ncbi.nlm.nih.gov/BLAST/ Blast.cgi)
on a server at the National Center for Biotechnology Infor-
mation (NCBI). Let us discuss the BLAST system in its
comparison of proteins (protein blast or blastp).
Protein databases presently contain ⬃900,000 nonredun-
dant peptide sequences. BLAST therefore minimizes the
time it spends on a sequence region whose similarity with
the query sequence has little chance of exceeding some min-
imal alignment score. Pairwise alignments (e.g., Fig. 7-31a),
which by default are found using BLOSUM62 (substitution
matrices and gap penalties can be optionally selected under
“Algorithm parameters”), are listed in order of decreasing
statistical significance and are presented in a manner that in-
dicates the positions of both the identical residues and simi-
lar residues in the query and subject sequences. The number
of identical residues, positives (those residue pairs whose ex-
change has a positive value in the substitution matrix used),
and gaps over the length of the alignment are indicated.
BLAST assesses the statistical significance of an alignment
in terms of its “E value” (E for Expect), which is the number
of alignments with at least the same score that would have
been expected to occur in the database by chance. For exam-
ple, an alignment with an E value of 5 is statistically insignif-
icant, whereas one with an E value of 0.01 is significant, and
one with an E value of 1 10
20
offers extremely high con-
fidence that the query and subject sequences are homolo-
gous. BLAST also reports a “bit score” for each alignment,
which is a type of normalized alignment score.
h. Multiple Sequence Alignments with CLUSTAL
BLAST makes only pairwise alignments. To simultane-
ously align more than two sequences, that is, to obtain a mul-
tiple sequence alignment such as Table 7-4, a different pro-
gram must used.Perhaps the most widely used such program,
ClustalW2, is publicly available for interactive use over the
Web at http://www.ebi.ac.uk/Tools/clustalw2/. The input for
the program is a file containing all the sequences (either pep-
tides or DNA) to be aligned.As with BLAST, the user can se-
lect the substitution matrix and the gap penalty parameters
that ClustalW2 uses. ClustalW2 begins by finding all possible
pairwise alignments of the input sequences. This permits the
Section 7-4. Bioinformatics: An Introduction 201
JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 201