Voet D., Voet Ju.G. Biochemistry

Подождите немного. Документ загружается.

Smith–Waterman algorithm exploits the property of the

substitution matrix-based scoring system that the cumula-

tive score for an alignment path decreases in regions in

which the sequences are poorly matched.Where the cumu-

lative score drops to zero, the Smith–Waterman algorithm

terminates the extension of an alignment path. Two pep-

tides may have several such local alignments.

f. Gap Penalties

If there are gaps in the alignment, one should now sub-

tract the gap penalty from the overall alignment score to

obtain the final alignment score. Since a single mutational

event can insert or delete more than one residue, a long gap

should be penalized only slightly more than a short gap.

Consequently, gap penalties have the form a  bk, where a

is the penalty for opening the gap, k is the length of the gap

in residues, and b is the penalty for extending the gap by

one residue. Current statistical theory provides little guid-

ance for optimizing a and b, but empirical studies suggest

that a 8 and b 2 are appropriate values for use with

the PAM-250 matrix. Thus the final alignment score

for both alignments in Fig. 7-29d (which have both a

1-residue gap and a 2-residue gap) is 41  (8  2  1) 

(8  2  2)  19.

g. Pairwise Alignments Using BLAST

The Needleman–Wunsch algorithm and later the

Smith–Waterman algorithm (in their computerized forms)

were widely used in the 1970s and 1980s to find relationships

between proteins. However, the need to compare every

newly determined sequence with the huge and rapidly grow-

ing number of sequences in publicly available databases re-

quires that this process be greatly accelerated. Modern com-

puters can do so using sequence alignment programs that

employ sophisticated heuristic algorithms (algorithms that

make educated “guesses”) but at the risk of obtaining subop-

timal results (in the case of sequence alignments, the heuris-

tic algorithms are based on knowledge of how sequences

evolve). Consequently, in what follows, we shall describe how

these programs are used rather than how they work.

The PAM-250 substitution matrix is based on an extrap-

olation: Its calculation assumes that the rate of mutation

over one PAM unit of evolutionary distance is the same as

that over 250 PAM units. This may not be the case. After

all, homologous proteins that are separated by large evolu-

tionary distances may diverge in their function and hence

their rates of evolution may change (recall that different

proteins have different rates of evolution; Fig. 7-23).To ac-

count for this possibility, and because of the huge amount

of sequence data that had become available since the PAM

matrices were calculated in the mid-1970s, a log odds sub-

stitution matrix based on ⬃2000 blocks of aligned sequences

that lacked gaps taken from ⬃500 groups of related proteins

was calculated. The substitution matrix that gives the most

sensitive performance for ungapped alignments is called

BLOSUM62 [blosum (pronounced “blossom”) for block

substitution matrix; the 62 indicates that all blocks

of aligned polypeptides in which there is 62% identity

are weighted as a single sequence in order to reduce con-

tributions from closely related sequences], whereas

BLOSUM45 appears to perform better for alignments with

gaps. Sequence alignments based on the BLOSUM62 or

BLOSUM45 matrices are more sensitive than are those

based on the PAM-250 matrix.

BLAST (for basic local alignment search tool) is the

most widely used, publicly available software package for

making pairwise sequence alignments—both for polypep-

tides and for polynucleotides. This program uses a heuristic

approach that approximates the Smith–Waterman algo-

rithm so as to obtain the optimum mix of sensitivity (the

ability to identify distantly related sequences) and selectiv-

ity (the avoidance of unrelated sequences with spuriously

high alignment scores). It pairwise aligns up to a user-

selected number of subject sequences (default 100) in the

chosen database(s) that are the most similar to the query se-

quence. BLAST, which was originated by Stephen Altschul,

is publicly available, free of charge, for interactive use over

the Web (http://www.ncbi.nlm.nih.gov/BLAST/ Blast.cgi)

on a server at the National Center for Biotechnology Infor-

mation (NCBI). Let us discuss the BLAST system in its

comparison of proteins (protein blast or blastp).

Protein databases presently contain ⬃900,000 nonredun-

dant peptide sequences. BLAST therefore minimizes the

time it spends on a sequence region whose similarity with

the query sequence has little chance of exceeding some min-

imal alignment score. Pairwise alignments (e.g., Fig. 7-31a),

which by default are found using BLOSUM62 (substitution

matrices and gap penalties can be optionally selected under

“Algorithm parameters”), are listed in order of decreasing

statistical significance and are presented in a manner that in-

dicates the positions of both the identical residues and simi-

lar residues in the query and subject sequences. The number

of identical residues, positives (those residue pairs whose ex-

change has a positive value in the substitution matrix used),

and gaps over the length of the alignment are indicated.

BLAST assesses the statistical significance of an alignment

in terms of its “E value” (E for Expect), which is the number

of alignments with at least the same score that would have

been expected to occur in the database by chance. For exam-

ple, an alignment with an E value of 5 is statistically insignif-

icant, whereas one with an E value of 0.01 is significant, and

one with an E value of 1  10

20

offers extremely high con-

fidence that the query and subject sequences are homolo-

gous. BLAST also reports a “bit score” for each alignment,

which is a type of normalized alignment score.

h. Multiple Sequence Alignments with CLUSTAL

BLAST makes only pairwise alignments. To simultane-

ously align more than two sequences, that is, to obtain a mul-

tiple sequence alignment such as Table 7-4, a different pro-

gram must used.Perhaps the most widely used such program,

ClustalW2, is publicly available for interactive use over the

Web at http://www.ebi.ac.uk/Tools/clustalw2/. The input for

the program is a file containing all the sequences (either pep-

tides or DNA) to be aligned.As with BLAST, the user can se-

lect the substitution matrix and the gap penalty parameters

that ClustalW2 uses. ClustalW2 begins by finding all possible

pairwise alignments of the input sequences. This permits the

Section 7-4. Bioinformatics: An Introduction 201

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 201

program to determine the relationships of the input se-

quences with one another based on their similarity scores

and hence generate a crude phylogenetic tree called a den-

drogram. Then, starting with the highest scoring pairwise

alignment, it sequentially carries out realignments on the ba-

sis of the remaining sequences, which it adds in order of their

decreasing relationships with the previously added se-

quences, while opening up gaps as necessary. The output of

ClustalW2 is the aligned sequences (e.g., Fig. 7-31b).

Since multiple sequence alignment programs are easily

confounded by such anomalies as sequences that are not

homologous or that contain homologous segments in dif-

ferent orders, multiple sequence alignments should be

carefully inspected to determine if they are sensible and, if

necessary, corrected and trimmed by hand. Indeed, in Fig.

7-31a, the alignment of the first 21 residues of the query se-

quence (P04168) with the subject sequence (P38524) dif-

fers from that in Fig. 7-30b.

i. The Use of Profiles Extends the Sensitivities of

Sequence Alignments

Multiple sequence alignments can be used to improve

the sensitivity of similarity searches, that is, to detect weak

but significant sequence similarities. For example, in pair-

wise alignments, peptide A may appear to be similar to pep-

tide B and peptide B to be similar to peptide C but peptides

A and C may not appear to be similar. However, a multiple

sequence alignment of peptides A, B, and C will reveal the

similarities between peptides A and C. This idea has been

extended through the construction of profiles (also called

position-specific score matrices) that take into account the

fact that some residues in a given protein are structurally

and/or functionally more important than others and there-

fore are less subject to evolutionary change. Hence, at each

residue position in a multiple sequence alignment, highly

conserved residues are assigned a large positive score,

weakly conserved positions are assigned a score near zero,

and unconserved residues are assigned a large negative

score. Many profile-generating algorithms are based on sta-

tistical models called hidden Markov models (HMMs). The

use of such conservation patterns has been successful in

finding sequences that are so distantly related to a query se-

quence (so far into the twilight zone) that BLAST would

not consider them to have significant sequence similarity.

The program PSI-BLAST (for Position-Specific Iter-

ated BLAST), which is also available for use at

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi, uses the re-

sults of a BLAST search with a query sequence to generate

202 Chapter 7. Covalent Structures of Proteins and Nucleic Acids

Figure 7-31 Examples of peptide sequence alignments. (a) A

BLAST pairwise alignment.The aligned proteins are high

potential iron–sulfur proteins (HiPIPs), small bacterial proteins

whose sequences are archived in the SWISS-PROT database.

The amino acid residues are indicated by their one-letter codes

(Table 4-1) and gaps are indicated by dashes. The query sequence

is HiPIP Isozyme 1 from Halorhodospira halophila

(HIP1_HALHA, SWISS-PROT accession number P04168;

isozymes are catalytically and structurally similar but genetically

distinct enzymes from the same organism) and the subject

sequence is HiPIP Isozyme 2 from Ectothiorhodospira vacuolata

(HPI2_ECTVA, SWISS-PROT accession number P38524).The

first two lines (green) identify the subject sequence and indicates

its length in residues.This is followed by an assortment of

(a) BLAST pairwise alignment

>sp P38524 HIP2_ECTVA HIGH POTENTIAL IRON-SULFUR PROTEIN, ISOZYME 2 (HIPIP 2)

Length  71

Score  44.3 bits (103), Expect  1e-04 Method: Compositional matrix adjust.

Identities  27/70 (38%), Positives  35/70 (50%), Gaps  4/70 (5%)

Query: 1

Sbjct: 2



–



–



(b) Clustal W2 Multiple Sequence Alignment

alignment statistics (black).The query and subject sequences are

then shown vertically aligned (blue) with the line between them

(black) indicating residues that are identical (by their one-letter

codes) and similar [by a plus ()].The output of BLAST consists

of a series of such pairwise alignments. (b) A ClustalW2 multiple

sequence alignment of five HiPIP sequences: the two foregoing

sequences, their corresponding Isozymes 1 and 2 (P38941 and

P04169), and HiPIP from Rhodocyclus gelatinosus (P00265).

Confidently aligned residues are shaded according to residue

type. The bar graph below the alignment is indicative of the

alignment quality based on the BLOSUM62 score of the

observed substitutions. Note that the alignment of the residues

numbered 9 through 29 of P04168 (the query sequence in Part a)

with P38524 differs from that in Part a.

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 202

a profile and then employs the profile to search for new

alignments. This is an iterative process in which the profile

generated after each alignment search is used to make a

new alignment search, etc., until no further significant

alignments are found. For instance, for the query sequence

used in Fig. 7-31a (HIP1_HALHA; SWISS-PROT acces-

sion number P04168), BLAST finds only five sequences

(called hits) that have E values less than 0.001 in the

SWISS-PROT database (which includes a self-alignment).

In contrast, PSI-BLAST finds 16 such hits after three itera-

tions (and no additional hits in the fourth iteration, where-

upon the search is said to have converged).Thus the use of

profile analysis makes possible the detection of subtle but

significant sequence relationships that, as we shall see in

succeeding chapters, lead to considerable evolutionary and

functional insights.

j. Structural Genes Should Be Aligned

as Polypeptides

In many cases, only the base sequence of the DNA en-

coding a protein is known. Indeed, most of the known pro-

tein sequences have been inferred from DNA sequences.

Although BLAST and ClustalW2 are both capable of

aligning nucleic acid sequences (and are routinely used to

do so), one should compare the inferred amino acid se-

quences of structural genes rather than just their base se-

quences. This is because amino acid sequence comparisons

can routinely identify sequences that shared a common an-

cestor over 1 billion years ago (e.g., those of cytochrome c

and histone H4; Fig. 7-23), whereas it is rare to detect ho-

mologies in noncoding DNA sequences that diverged

200 million years (Myr) ago and in coding sequences that

diverged 600 Myr ago.The reasons for this are threefold:

1. DNA has only four different bases, whereas peptides

consist of 20 different amino acid residues. Consequently, it

is much easier to find spurious alignments with DNA, at

least for short segments, than with peptides (a dot plot of

two unrelated DNAs has, on average, 25% of its spaces

filled vs 5% for unrelated polypeptides).

2. DNA evolves much more quickly than proteins. In

the coding regions of structural genes, 24% of single base

changes encode the same amino acid.There are few evolu-

tionary constraints to maintain the sequence identity of

these bases or of the gene’s noncoding regions (e.g., those

containing introns). Consequently, the evolutionary con-

straints on proteins are more stringent than those on DNA.

3. Direct DNA sequence alignments do not use amino

acid substitution matrices such as PAM-250 and BLO-

SUM62 and, hence, are not constrained by the evolution-

ary information implicit in these matrices (although there

are analogous 4  4 matrices for base substitutions).

If the base sequence of a structural gene is known, its

putative control regions, particularly its start and stop

codons, can usually be identified. This, in turn, reveals

which of two complementary DNA strands is the so-called

sense strand (which has the same sequence as the mRNA

transcribed from the DNA) and indicates its correct read-

ing frame. Even if it is unclear that a DNA segment that is

flanked by what appear to be start and stop codons actually

encodes a protein, one can compare the amino acid se-

quences from all six possible reading frames (three each

from the DNA’s two complementary strands). In fact,

BLAST does so automatically when aligning peptide se-

quences based on DNA sequences.

C. Construction of Phylogenetic Trees

Phylogenetic trees were first made by Linnaeus, the eigh-

teenth century biologist who originated the system of tax-

onomy (biological classification) still in use today. These

trees (e.g., Fig. 1-4) were originally based on morphological

characteristics, whose measurements were largely subjec-

tive. It was not until the advent of sequence analysis, how-

ever, that the generation of phylogenetic trees was put on a

firm quantitative basis (e.g., Fig. 7-22). In the following

paragraphs we discuss the characteristics of phylogenetic

trees and outline how they are generated.

Figure 7-32a is a phylogenetic tree diagramming the

evolutionary relationships between four homologous

genes, A, B, C, and D.The tree consists of four leaves or ex-

ternal nodes, each representing one of these genes, and two

branch points or internal nodes, which represent ancestral

genes, with the length of each branch indicating the degree

of difference between the two nodes they connect. All

branch points are binary, that is, one gene is considered to

Section 7-4. Bioinformatics: An Introduction 203

Figure 7-32 Phylogenetic trees. (a) An unrooted tree with four

leaves (A, B, C, and D) and two branch points. (b) The five

rooted trees that can be generated from the unrooted tree in

Part a.The roots are drawn and numbered in red.

Branches

Leaves

Branch

points

(a)

(b)

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 203

give rise to only two descendents at a time so that branches

can only bifurcate (although branch points may turn out to

be so close together that their order cannot be determined;

e.g., the root of Fig. 7-22). Note that this is an unrooted tree,

that is, it indicates the relationships between the four genes

but provides no information as to the evolutionary events

through which they arose. The five different evolutionary

pathways that are possible for our unrooted tree are dia-

grammed in Fig. 7-32b as different rooted trees, where the

node at which the root joins the tree represents the four

genes’ last common ancestor. With knowledge of only the

A, B, C, and D genes, phylogenetic analysis cannot distin-

guish between these rooted trees. In order to find a tree’s

root, it is necessary to obtain the sequence of an outgroup,

a homologous gene that is less closely related to the genes

in the tree than they are to each other.This permits the root

of the tree to be identified and hence the pathway through

which the genes evolved to be elucidated.

The number of different bifurcating trees with the same n

leaves increases extremely rapidly with n (e.g., for n  10,it is

over 2 million). Unfortunately, there is no exact method for

generating an optimal phylogenetic tree. Indeed, there is no

general agreement on what constitutes an optimal tree. Con-

sequently, numerous methods have been formulated for con-

structing phylogenetic trees based on sequence alignments.

In one class of methods for constructing phylogenetic

trees, the sequence data are converted into a distance ma-

trix, which is a table showing the evolutionary distances be-

tween all pairs of genes in the data set (e.g.,Table 7-5). Evo-

lutionary distances are the number of sequence differences

between two genes (ideally corrected for the possibility of

multiple mutations at a given site). These quantities are

used to calculate the lengths of the branches in a tree under

the assumption that they are additive, that is, the distance

between any pair of leaves is the sum of the lengths of the

branches connecting them.

Perhaps the conceptually simplest (if it can be called

that) way of generating a phylogenetic tree is the neighbor-

joining (N-J) method, in which it is initially assumed that

there is only one internal node, Y, and hence that all N

leaves radiate from it in a starlike pattern (Fig. 7-33a). The

branch lengths in the star are then calculated according to

such relationships as d

 d

 d

(where d

is the to-

tal length of the branches connecting leaves A and B, etc.),

 d

 d

, and d

 d

 d

, so that, for instance,

 d

 d

). A pair of leaves is then trans-

ferred from the star to a new internal node, X, that is

connected to the center of the star by a new branch, XY

(Fig. 7-33b), and the sum of all the branch lengths, S

,in

this revised tree is calculated:

[7.6]

where

[7.7]Q 

i1

i1

j1





[2Q  R

 R

]

2(N  2)

 d

 d



kAB



(that is, the sum of all the off-diagonal elements in the

unique half of the distance matrix),

[7.8]

(that is, the sum of the elements in the Ath row of the dis-

tance matrix), and

[7.9]

The two leaves are then returned to their initial positions,

replaced by a second pair of leaves, and the total branch

length again calculated. The process is repeated until all

possible N(N  1)/2 pairs of leaves have been so treated.

The pair yielding the smallest value of S

(the shortest total

branch length) in this process, which will be nearest neigh-

bors in the final tree, are combined into a single unit of

their average length, yielding a star with one less branch. If

leaves A and B are chosen as neighbors, then the lengths of

the branches connecting them are estimated to be

[7.10]

[7.11]

and

[7.12]d



(N  1)(R

 R

)  2Q  (N

 3N  2)d

2(N  2)(N  3)

 d

 d





 R

2(N  2)



i1



i1

204 Chapter 7. Covalent Structures of Proteins and Nucleic Acids

Figure 7-33 Manipulations employed in the neighbor-joining

method for the construction of a phylogenetic tree. (a) The

starting configuration. (b) The transfer of leaves A and B to a

new branch point that is connected to the central star (red).

(a)

(b)

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 204

Assuming that S

has the lowest value of all the S

a new distance matrix is calculated whose elements,

, are the same as the d

with the exception that

where is the dis-

tance between the averaged leaves A and B and leaf i.The

entire process is then iterated so as to find all pairs of nearest-

neighbor sequences, thereby generating a phylogenetic

tree. Figure 7-34 is an unrooted phylogenetic tree that

Clustal generated from the multiple sequence alignment

shown in Fig. 7-31b using the N-J method.

The N-J method is representative of distance-based

tree-building procedures.There are two other types of tree-

building criteria that are in widespread use:

1. Maximum parsimony (MP), which is based on the

principle of Occam’s razor: The best explanation of the

data is the simplest.Thus, MP-based methods assume (per-

haps inaccurately) that evolution occurs via the fewest pos-

sible genetic changes and hence that the best phylogenetic

tree requires the smallest number of sequence changes to

account for a multiple sequence alignment.

2. Maximum likelihood (ML), which finds the tree and

branch lengths that have the highest probability of yielding

the observed multiple sequence alignment.This, in turn, re-

quires an evolutionary model that indicates the probability

of occurrence for each type of residue change (e.g., the

PAM substitution matrices).

Since the number of possible trees increases very rap-

idly with the number of leaves, the construction of a phylo-

genetic tree is a computer-intensive task for even relatively

small sets of aligned sequences (e.g., N  20, although dis-

tance-based methods require far fewer computations than

do MP- or ML-based methods).And, because of the ambi-

guities inherent in all known tree-building procedures, sta-

tistical tests have been developed to check the validity of

any particular tree.

d¿

AB,i

d¿

AB,i

 d¿

i,AB

 (d

 d

)>2,

d¿

5 CHEMICAL SYNTHESIS

OF POLYPEPTIDES

In this section we describe methods for the chemical syn-

thesis of polypeptides from amino acids. The ability to

manufacture polypeptides not available in nature has con-

siderable biomedical potential:

1. To investigate the properties of polypeptides by sys-

tematically varying their side chains.

2. To obtain polypeptides with unique properties, par-

ticularly those with nonstandard side chains or with iso-

topic labels incorporated in specific residues (neither of

which is easily accomplished using biological methods).

3. To manufacture pharmacologically active polypep-

tides that are biologically scarce or nonexistent.

One of the most promising applications of polypeptide

synthesis is the production of synthetic vaccines. Vaccines,

which have consisted of viruses that have been “killed” (in-

activated) or attenuated (“live” but mutated so as not to

cause disease in humans), stimulate the immune system to

synthesize antibodies specifically directed against these

viruses, thereby conferring immunity to them (the immune

response is discussed in Section 35-2A). The use of such

vaccines, however, is not without risk; attenuated viruses,

for example, may mutate to a virulent form and “killed”

virus vaccines have, on several occasions, caused disease

because they contained “live” viruses. Moreover, it is diffi-

cult to culture many viruses and therefore to obtain suffi-

cient material for vaccine production. Such problems

would be eliminated by preparing vaccines from synthetic

polypeptides that have the amino acid sequences of viral

epitopes (antigenic determinants; molecular groupings that

stimulate the immune system to manufacture antibodies

against them). Indeed, several such synthetic vaccines are

already in general use.

The first polypeptides to be chemically synthesized

were composed of only one type of amino acid and are

therefore known as homopolypeptides. Such compounds

as polyglycine, polyserine, and polylysine are easily synthe-

sized according to classic methods of polymer chemistry.

They have served as valuable model compounds in study-

ing the physicochemical properties of polypeptides, such as

conformational behavior and interactions with the aque-

ous environment.

The first chemical synthesis of a biologically active

polypeptide was that of the nonapeptide (9-amino acid

residue) hormone oxytocin by Vincent du Vigneaud in

1953:

Improvements in polypeptide synthesis methodology since

then have led to the synthesis of numerous biologically ac-

tive polypeptides and several proteins.

Cys Tyr Ile Gln GlyAsn

Oxytocin

Cys Pro

Leu

Section 7-5. Chemical Synthesis of Polypeptides 205

Figure 7-34 An unrooted phylogenetic tree of the five HiPIP

sequences that are aligned in Fig. 7-31b. The tree was generated

by Clustal using the neighbor-joining method.The numbers

indicate the relative lengths of the associated branches.

0.153

0.310

0.018

0.176

0.133

0.147

0.176

sp|P04169|HIP2_HALH

sp|P04168|HIP1_HALH

sp|P00265|HPI_RHOGE

sp|P38941|HIP1_ECTVA

sp|P38524|HIP2_ECTVA

0.197

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 205

A. Synthetic Procedures

Polypeptides are chemically synthesized by covalently

linking (coupling) amino acids, one at a time, to the termi-

nus of a growing polypeptide chain. Imagine that a

polypeptide is being synthesized from its C-terminus to-

ward its N-terminus; that is, the growing chain ends with a

free amino group.Then each amino acid being added to the

chain must already have its own -amino group chemically

protected (blocked) or it would react with other like mole-

cules as well as with the N-terminal amino group of the

chain. Once the new amino acid is coupled, its now N-

terminal amino group must be deprotected (deblocked) so

that the next peptide bond can be formed. Every cycle of

amino acid addition therefore requires a coupling step and a

deblocking step. Furthermore, reactive side chains must be

blocked to prevent their participation in the coupling reac-

tions, and then deblocked in the final step of the synthesis.

The reactions that were originally developed for synthe-

sizing polypeptides such as oxytocin take place entirely in

solution.The losses that are incurred on isolation and purifi-

cation of the reaction product in each of the many steps,

however, contribute significantly to the low yields of final

polypeptide.This difficulty was ingeniously circumvented in

1962 by Bruce Merrifield, through his development of solid

phase peptide synthesis (SPPS). In SPPS, a growing

polypeptide chain is covalently anchored, usually by its C-

terminus, to an insoluble solid support such as beads of poly-

styrene resin, and the appropriately blocked amino acids and

reagents are added in the proper sequence (Fig. 7-35). This

permits the quantitative recovery and purification of inter-

mediate products by simply filtering and washing the beads.

When polypeptide chains are synthesized by amino acid

addition to their N-terminus (the opposite direction to that

in protein biosynthesis; Section 5-4Ba), the -amino group

of each sequentially added amino acid must be chemically

protected during the coupling reaction. The tert-butyloxy-

carbonyl (Boc) group is frequently used for this purpose,

t-Butyloxycarbonyl

chloride

-Amino acid

Cl CH



(CH

)

C H

Boc-amino acid

NH C



(CH

)

C CH

HCl

206 Chapter 7. Covalent Structures of Proteins and Nucleic Acids

Figure 7-35 Flow diagram for polypeptide synthesis

by the solid phase method. The symbol M

represents the

ith amino acid residue to be added to the polypeptide, S

is its side chain protecting group, and Y represents the

main chain protecting group. The specific reactions are

discussed in the text. [After Erikson, B.W. and Merrifield,

R.B., in Neurath, H. and Hill, R.L. (Eds.), The Proteins

(3rd ed.),Vol. 2, p. 259, Academic Press (1979).]

Resin

...

Resin

Y M

...

Resin

...

One cycle

n–1 Cycles

Coupling to

support

Main chain

deblocking

Amino acid

coupling

(peptide bond

formation)

Cleavage and

side chain

deblocking

+ YnS

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 206

as is the 9-fluorenylmethoxycarbonyl (Fmoc) group:

Both groups undergo analogous reactions, although, in

what follows, we shall only discuss the Boc group.

a. Anchoring the Chain to the Inert Support

The first step in SPPS is the coupling of the C-terminal

amino acid to a solid support. The most commonly used

support is a cross-linked polystyrene resin with pendant

chloromethyl groups. Resin coupling occurs through the

following reaction:

and the resulting -amino acid–derivatized resin is filtered

and washed. The amino group is then deblocked by treat-

ment with an anhydrous acid such as trifluoroacetic acid,

which leaves intact the alkylbenzyl ester bond to the sup-

port resin:

Boc-amino acid

Isobutylene

O O

NH C O



Resin

COOH in CH

(CH

)

C CH

Resin

(CH

)

CCH

Boc-amino acid

Inert

support

NH C 



ClCH

NHCl





Resin

(CH

)

C CH

O R

NH C O

Resin

(CH

)

C CH

9-Fluorenylmethoxycarbonyl

(Fmoc) group

b. Coupling the Amino Acids

The reaction coupling two amino acids through a

peptide bond is endergonic and therefore must be

activated to obtain significant yields. Carbodiimides

such as dicyclohexylcarbodiimide

(DCCD) are commonly used coupling agents:

The O-acylurea intermediate that results from the reaction

of DCCD with the carboxyl group of a Boc-protected -

amino acid readily reacts with the resin-bound -amino

(CH

)

COC

NH CH

O-Acylurea intermediate

N,N'-Dicyclohexylurea

(CH

)

CCH

Resin

Dipeptidyl-resin

NH C

Boc-amino acid

Dicyclohexyl-

carbodiimide

(DCCD)



N CH

Resin

Resin-bound

amino acid

NH C

(CH

)

C CH

(R¬N“C“N¬R¿)

Section 7-5. Chemical Synthesis of Polypeptides 207

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 207

acid to form the desired peptide bond in high yield. By sub-

sequently alternating the deblocking and coupling reac-

tions, a polypeptide with the desired amino acid sequence

can be synthesized. The repetitive nature of these operations

allows the SPPS method to be easily automated.

During the course of peptide synthesis, many of the side

chains also require protection to prevent their reaction

with the coupling agent.Although there are many different

blocking groups, the benzyl group is the most widely used

(Fig. 7-36).

c. Releasing the Polypeptide from the Resin

The final step in SPPS is the cleavage of the polypeptide

from the solid support. The benzyl ester link from the

polypeptide’s C-terminus to the support resin may be

cleaved by treatment with liquid HF:

The Boc group linked to the polypeptide’s N-terminus as

well as the benzyl groups protecting its side chains are also

removed by this treatment.

B. Problems and Prospects

The steps just outlined seem simple enough, but they are

not as straightforward as we have implied. A major diffi-

culty with the entire procedure is its low cumulative yield.

Let us examine the reasons for this. To synthesize a

polypeptide chain with n peptide bonds requires at least 2n

reaction steps—one for coupling and one for deblocking

each residue. If a protein-sized polypeptide is to be synthe-

sized in reasonable yield, then each reaction step must be

essentially quantitative; anything less greatly reduces the

yield of final product. For example, in the synthesis of a

101-residue polypeptide chain, in which each reaction step

occurs with an admirable 98% yield through 200 reaction

steps, the overall yield is only 0.98

200

 100  2%. There-

fore, although oligopeptides can be routinely made, the

synthesis of large polypeptides requires almost fanatical at-

tention to chemical detail.

An ancillary problem is that the newly liberated syn-

thetic polypeptide must be purified.This may be a difficult

O R

(CH

)

C Resin

(CH

)

COC CH

C



C N

O R

...

liquid HF

O R

Resin

FCH

C N



...

task because a significant level of incomplete reactions

and/or side reactions at every stage of SPPS will result in

almost a continuum of closely related products for large

polypeptides. The use of reverse-phase HPLC techniques

(Section 6-3Dh), however, greatly facilitates this purifica-

tion process, and the quality of both intermediate and final

products can be readily assessed through mass spectromet-

ric techniques (Section 7-1I).

Using automated SPPS, Merrifield synthesized the nona-

peptide hormone bradykinin in 85% yield:

Bradykinin

However, it was only in 1988, through steady progress in

improving reaction yields (to 99.5% on average) and

eliminating side reactions, that it became possible to syn-

thesize ⬃100-residue polypeptides of reasonable quality.

Thus, Stephen Kent synthesized the 99-residue HIV-1

protease [an enzyme that is essential for the maturation of

Arg¬Pro¬Pro¬Gly¬Phe¬Ser¬Pro¬Phe¬Arg

208 Chapter 7. Covalent Structures of Proteins and Nucleic Acids

Figure 7-36 A selection of amino acids with benzyl-protected

side chains and a Boc-protected ␣-amino group. These substances

can be used directly in the coupling reactions forming peptide

bonds.

(CH

)

(CH

)

()

CH COOHNH

(CH

)

(CH

)

CH COOHNH

Boc, N

-benzyloxycarbonyl-Lys

Boc, S-benzyl-Cys

Boc-Glu, γ -Benzyl ester

Boc, O-benzyl-Ser

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 208

human immunodeficiency virus-1 (HIV-1, an AIDS virus;

Section 15-4C)] in such high yield and purity that, after

being renatured (folded to its native conformation; Sec-

tion 9-1A), it exhibited full biological activity. Indeed, this

synthetic protein was crystallized and its X-ray structure

was shown to be identical to that of biologically synthe-

sized HIV-1 protease. Kent also synthesized HIV-1 pro-

tease from

D-amino acids and experimentally verified, for

the first time, that such a protein has the opposite chirality

of its biologically produced counterpart. Moreover, this

D-amino acid protease catalyzes the cleavage of its target

polypeptide made from

D-amino acids but not that made

from

L-amino acids as does naturally occurring HIV-1

protease.

Despite the foregoing, the accumulation of resin-bound

side products limits the sizes of the polypeptides that can rou-

tinely be synthesized by SPPS to ⬃60 residues. Kent has par-

tially circumvented this limitation through the development

of the so-called native chemical ligation reaction, which links

together two polypeptides in a peptide bond to routinely

yield polypeptides as large as ⬃120 residues (Fig. 7-37).

Moreover, several peptide segments can be consecutively

linked by native chemical ligation, so that the chemical syn-

thesis of polypeptides consisting of several hundred

residues can be anticipated. In fact, using this technique,

Kent synthesized the 203-residue “covalent dimer” of HIV-1

protease (Section 15-4C) from four synthetic peptides—

the largest linear polypeptide yet synthesized—and

showed that it was fully enzymatically active.

6 CHEMICAL SYNTHESIS

OF OLIGONUCLEOTIDES

Molecular cloning techniques (Section 5-5) have permit-

ted the genetic manipulation of organisms in order to in-

vestigate their cellular machinery, change their charac-

teristics, and produce scarce or specifically altered

proteins in large quantities. The ability to chemically syn-

thesize DNA oligonucleotides of specified base sequences

is an indispensable part of this powerful technology. Thus,

as we have seen, specific oligonucleotides are required as

probes in Southern blotting (Section 5-5D) and in situ

hybridization (Section 5-5Ea), as primers in PCR (Sec-

tion 5-5F), and to carry out site-directed mutagenesis

(Section 5-5Gc).

A. Synthetic Procedures

The basic strategy of oligonucleotide synthesis is analo-

gous to that of polypeptide synthesis (Section 7-5A): A

suitably protected nucleotide is coupled to the growing end

of the oligonucleotide chain, the protecting group is re-

moved, and the process is repeated until the desired

oligonucleotide has been synthesized. The first practical

technique for DNA synthesis, the phosphodiester method,

which was developed by H. Gobind Khorana in the 1960s,

is a laborious process in which all reactions are carried

out in solution and the products must be isolated at each

stage of the multistep synthesis. Khorana, nevertheless,

Section 7-6. Chemical Synthesis of Oligonucleotides 209

Figure 7-37 The native chemical ligation reaction. Peptide-1

has a C-terminal thioester group (R is an alkyl group) and

Peptide-2 has a N-terminal Cys residue. The reaction, which

occurs in aqueous solution at pH 7, is initiated by the nucleophilic

attack of Peptide-2’s Cys thiol group on Peptide-1’s thioester

group to yield, in a thiol exchange reaction, a new thioester



COO



Peptide-2C

Peptide-1



COO



Peptide-2C

Peptide-1

COO



 H



Peptide-1 Peptide-2H





SR  H

group. This intermediate (as is indicated by the square brackets),

undergoes a rapid intramolecular nucleophilic attack to yield a

native peptide bond at the ligation site. [After Dawson, P.E.,

Muir,T.W., Clark-Lewis, I., and Kent, S.B.H., Science 266, 777

(1994).]

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 209

210 Chapter 7. Covalent Structures of Proteins and Nucleic Acids

Figure 7-38 Reaction cycle in the phosphoramidite method of

oligonucleotide synthesis. Here B

, and B

represent

protected bases, and S represents an inert solid phase support

such as controlled-pore glass.

DMTr

O S

DMTr

Detritylation1.

O S

DMTr

Coupling2.

Acetic

anhydride

CCH

O S

Capped failure sequence

(no further extension)

O P N(C

)

Tetrazole

O S

DMTr O

Oxidation4.

/ H

O S

DMTr

OCH

DMTr

Dimethoxytrityl

R : N CH

C CH

β-Cyanoethyl

Capping of

unreacted 5' end

HN(C

)

JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 210