
strand. One DNA chain can serve as a template for the assembly of a second,
complementary DNA strand, which is precisely how the genome is replicated
prior to cell division. Complementary DNA will spontaneously pair up (or
“hybridize”) at room or body temperature, and can be caused to separate or
“melt” at elevated temperature, which was later exploited in the widely used
polymerase chain reaction (PCR) m ethod of DNA amplification in the
laboratory.
So how does the genome, encoded in the sequen ce of DNA, become the thousands
of proteins within the cell that interact and participate in chemical reactions? It has
been determined that each gene corresponds in general to one protein. The process
by which the genes encoded in DNA are made into the myriad of proteins found in
the cell is referred to as the “central dogma” of molecular biology. Information
contained in DNA is “transcribed” into the single-stranded polynucleotide ribonu-
cleic acid (RNA) by RNA polymerases. The nucleotide sequences in DNA are
copied directly into RNA, except for thymine (T), which is substituted with uracil
(U, a pyrimidine) in the new RNA molec ule. The RNA gene sequences are then
“translated” into the amino acid sequences of different proteins by protein/RNA
complexes known as ribosomes. Groups of three nucleotides, each called a “codon”
(N =4
3
= 64 combinations), encode the 20 different amino acids that are the
building blocks of proteins. Table 9.1 shows the 64 codons and the corresponding
amino acid or start/stop signal.
The three stop codons, UAA, UAG, and UGA, which instruct ribosomes to
terminate the translation, have been given the names “ochre,” “amber,” and
“opal,” respectively. The AUG codon, encoding for methionine, represents the
translation start signal. Not all of the nucleotides in DNA encode for proteins,
however. Large stretches of the genome, called introns, are spliced out of the
sequence by enzyme complexes which recognize the proper splicing signals, and
the remaining exons are joined together to form the protein-encoding portions.
Major sequence repositories are curated by the National Center for Biotechnology
Information (NCBI) in the United States, the European Molecular Biology
Laboratory (EMBL), and the DNA Data Bank of Japan.
9.2 Sequence alignment and database searches
The simplest method for identifying regions of similarity between two sequences
is to produce a graphical dot plot. A two-dimensional graph is generated, and a
dot is placed at each position where the two compared sequences are identical.
The word size can be specified, as in MATL AB program 9.1, to reduce the
noisiness produced by many very short (length = 1–2) regions of similarity.
Identity runs along the main diagonal, and common subsequences including
possible translocations are seen as shorter off-diagonal lines. Inversions,wherea
region of the gene runs in the opposite direction, appear as lines perpendicular to
the main diagonal, while deletions appear as interruptions in the lines. Figure 9.1
shows the sequence of human P-selectin, an adhesion protein important in
inflammation, compared against itself. In the P-selectin secondary structure,
it is known that nine consensus repeat domains
1
exist, and these can be seen as
1
A consensus repeat domain is a sequence of amino acids that occurs with high frequency in a
polypeptide.
540
Basic algorithms of bioinformatics