marks often take the form of 200- to 300-bp segments known
as sequence-tagged sites (STSs), whose exact sequence oc-
curs nowhere else in the genome. Hence, two clones that
contain the same STS must overlap. The STS-containing in-
serts are then randomly fragmented (usually by sonication;
Section 5-3D) into ⬃40-kb segments that are subcloned into
cosmid vectors so that a high resolution map can be con-
structed by identifying their landmark overlaps. The cosmid
inserts are then randomly fragmented into overlapping 5- to
10-kb or 1-kb segments for insertion into plasmid or M13
vectors (shotgun cloning; Section 5-5E).These inserts (⬃800
M13 clones per cosmid) are then sequenced (⬃400 bp per
clone) and the resulting so-called reads are assembled com-
putationally into contigs to yield the sequence of their par-
ent cosmid insert (with a redundancy of 400 bp per clone
800 clones per cosmid/40,000 bp per cosmid 8).Finally,the
cosmid inserts are assembled, through cosmid walking (the
computational analog of chromosome walking; Fig. 5-51),
using their landmark overlaps (with landmarks ideally
spaced at intervals of 100 kb or less), to yield the sequences
of the YAC inserts which are then assembled, using their
STSs, to yield the chromosome’s sequence.
The genomes of most complex eukaryotes contain nu-
merous tracts of repetitive sequences, that is, segments of
DNA that are tandemly repeated hundreds, thousands, and
in some cases millions of times (Section 34-2B). Lengthy
tracts of repetitive sequences easily confound the forego-
ing assembly process, leading to gaps in the sequence.
Moreover, such repetitive sequences greatly exacerbate
the difficulty of finding properly spaced STSs. To partially
circumvent the latter difficulty, STS-like sequences of
cDNAs, known as expressed sequence tags (ESTs), are
used in place of STSs. Since the mRNAs from which
cDNAs are reverse transcribed encode proteins, they are
unlikely to contain repetitive sequences.
b. The Whole Genome Shotgun Assembly Strategy
Although the initial goal of the human genome project
of identifying STSs and ESTs every ⬃100 kb in the human
genome was achieved, advances in computational and
cloning technology permitted a more straightforward se-
quencing procedure that eliminates the need for both the
low resolution (YAC) and high resolution (cosmid) map-
ping steps. In this so-called whole genome shotgun assem-
bly (WGSA) strategy, which was formulated by Venter,
Hamilton Smith, and Leroy Hood, a genome is randomly
fragmented, a large number of cloned fragments are se-
quenced, and the genome is assembled by identifying over-
laps between pairs of fragments. Statistical considerations
indicate that, using this strategy, the probability that a
given base is not sequenced is ideally e
c
, where c is the re-
dundancy of coverage [c LN/G, where L is the average
length of the reads in nucleotides (nt), N is the number of
reads, and G is the length of the genome in nt], the aggre-
gate length of the gaps between contigs is Ge
c
, and the av-
erage gap size is G/N. Moreover, without a long-range
physical and/or genetic map of the genome being se-
quenced, the order of the contigs and their relative orienta-
tions would be unknown.
For bacterial genomes, the WGSA strategy is carried
out straightforwardly by sequencing tens of thousands of
fragments and assembling them (a task that required the
development of computer algorithms capable of assem-
bling contigs from very large numbers of reads). Then, in a
task known as finishing, the gaps between contigs are filled
in by several techniques including synthesizing PCR
primers complementary to the ends of the contigs and us-
ing them to isolate the missing segments (chromosome
walking; bacterial genomes have few if any repetitive se-
quences).
For eukaryotic genomes, their much greater sizes re-
quire that the WGSA strategy be carried out in stages as
follows (Fig. 7-16b). A bacterial artificial chromosome
(BAC) library of ⬃150-kb inserts is generated (for the hu-
man genome, an ⬃15-fold redundancy, which would still
leave ⬃900 bases unsequenced, would require ⬃300,000
such clones; BACs are used because they are subject to
fewer technical difficulties than are YACs). The insert in
each of these BAC clones is identified by sequencing
⬃500 bp in from each end to yield segments known as se-
quence-tagged connectors (STCs or BAC-ends; which for
the above 300,000 clones would collectively comprise
⬃300,000 kb, that is, 10% of the entire human genome).
One BAC insert is then fragmented and shotgun cloned
into plasmid or M13 vectors (so as to yield ⬃3000 overlap-
ping clones), and the fragments are sequenced and assem-
bled into contigs. The sequence of this “seed” BAC is then
compared with the database of STCs to identify the ⬃30
overlapping BAC clones. The two with minimal overlap at
either end are then selected, sequenced, and the operation
repeated until the entire chromosome is sequenced (BAC
walking), which for the human genome required 27 million
sequencing reads.This process is also confounded by repet-
itive sequences.
The WGSA strategy is readily automated through ro-
botics and hence is faster and less expensive than the map-
based strategy. Indeed, most known genome sequences
have been determined using the WGSA strategy, many in a
matter of a few months, and its advent reduced the time to
sequence the human genome by several years. Neverthe-
less, it appears that for eukaryotic genomes, most of the
residual errors in a WGSA-based genome sequence
[mainly the failure to recognize long (15 kb) segments
that have nearly (97%) identical sequences] can be elim-
inated by finishing it through the use of some of the tech-
niques of the map-based strategy.
c. The Human Genome Has Been Sequenced
The “rough draft” of the human genome was reported
in 2001 by two independent groups: the publicly funded
International Human Genome Sequencing Consortium
(IHGSC; a collaboration involving 20 sequencing centers in
six countries), led by Francis Collins, Eric Lander, and John
Sulston, which used the map-based strategy; and a privately
funded group, mainly from Celera Genomics, led by Venter,
which used the WGSA strategy. The IHGSC-determined
genome sequence was a conglomerate from numerous
anonymous individuals, whereas that from Celera Genomics
Section 7-2. Nucleic Acid Sequencing 181
JWCL281_c07_163-220.qxd 2/22/10 9:11 PM Page 181