
reported. A variation on the BLAST algorithm, which includes the possibility of two
aligned sequences connected with a gap, is called BLA ST2 or Gapped BLA ST. It
finds two non-overlapping hits with a score of at least T and a maximum distance d
from one another. An ungapped extension is performed and if the generated highest-
scoring segmen t pairs (HSP) have sufficiently high scores when normalized appro-
priately by the segment length, then a gapped extension is initiated. Results are
reported for alignments showing sufficiently low E-value. The E-value is a statistic
which gives a measure of whether pot ential matches might have arisen by chance. An
E-value of 10 signifies that ten matches would be expected to occur by chance. The
value 10 as a cutoff is often used as a default, and E < 0.05 is generally taken as being
significant. Generally speaking, long regions of moderate similarity are more sig-
nificant than short regions of high identity. Some of the specific BLA ST programs
available at the NIH Natio nal Center for Biotechnology Inform ation (NCBI, http://
blast.ncbi.nlm.nih .gov/Blast.cgi) include: blastn, for submitting nucleotide queries
to a nucleotide database; blastp, for protein queries to the protein database; blastx,
for searching protein databases using a translated nucleotide query; tblastn, for
searching translated nucleotide databases using a protein query; and tblastx, for
searching translated nucleotide databases using a translated nucleotide que ry.
Further variations on the original BLAST algorithm are available at the NCBI
Blast query page, and can be selected as options. PSI-BLAST, or Position-Specific
Iterated BLAST, allows the user to build a position-specific scoring matrix using the
results of an initial (default option) blastp query. PHI-BLAST, or Pattern Hit
Initiated BLAST, finds pro teins which contain the pattern and similarity within
the region of the pattern, and is integrated with PSI-BLAST (Altschul et al., 1997).
There are many situations where it is desirable to perform sequence comparison
between three or more proteins or nucleic acids. Such comparisons can help to identify
functionally important sites, predict protein structure, or even start to reconstruct the
evolutionary history of the gene. A number of algorithms for achieving multiple
sequence alignment are available, falling into the categories of dynamic programming,
progressive alignment,oriterative search methods. Dynamic programming methods
proceed as sketched out above. The classic Needleman–Wunsch method is followed
with the difference that a higher-dimensional array is used in place of the two-
dimensional matrix. The number of comparisons increases exponentially with the
number of sequences; in practice this is dealt with by constraining the number of letters
or words that must be explicitly examined (Carillo and Lipman, 1988). In the pro-
gressive alignment method, we start by obtaining an optimal pairwise alignment
between the two most similar sequences among the query group. New, less related
sequences are then added one at a time to the first pairwise alignment. There is no
guarantee following this method that the optimal alignment will be found, and the
process tends to be very sensitive to the initial alignments. The ClustalW (“Weighted”
Clustal; freely available at www.ebi.ac.uk/clustalw or align.genome.jp among other
sites) algorithm is a general purpose multiple sequence alignment program for DNA
or proteins. It performs pairwise alignments of all submitted sequences (maximum
sequence no. = 30; maximum length = 10 000) and then produces a phylogenetic
tree (see Section 9.3 ) for comparing evolutionary relationships (Thompson et al.,
1994). Sequences are submitted in FASTA format, and the newer versions of
ClustalW can produce graphical output of the results. T-Coffee (www.ebi.ac.uk/
t-coffee or www.tcoffee.org) is another multiple sequence alignment program, a
progressive method which combines information from both local and global align-
ments (Notredame et al., 2000 ). This helps to minimize the sensitivity to the first
549
9.2 Sequence alignment and database searches