
models of sound generation and propagation along the
vocal tract. A somewhat comprehensive review of this
method is given in [3]. Due to high computational
requirements and the need for highly accurate mode l-
ing, articulatory synthesis is mostly useful for research
in speech production. It usually delivers unacceptably
low-quality synthe tic speech.
One level higher in abstraction, and much more
practical in its use, is formant synthesis. This method
captures the characteristics of the resonances of
the human vocal tract in terms of simple filters. The
single-peaked frequency characteristic of such a filter
element is called formant. Its frequency, bandwidth
(narrow to broad), and amplitude fully specify each
formant. For adult vocal tracts, four to five formants
are enoug h to determine their acoustic filter character-
istics. Phonetically most relevant are the lowest three
formants that span the vowel and sonorant space of a
speaker and a language. Together with a suitable wave-
form generator that approximates the glottal pulse,
formant synthesis systems, due to their highly versatile
control parameter sets, are very useful for speech per-
ception research. More on formant synthesis can be
found in [4]. For use as a speech synthesizer, the
computational requirements are relatively low, making
this method the preferred option for embedded appli-
cations, such as reading back names (e.g., ‘‘calling
Mom’’) in a dial-by-voice cellular phone handset. Its
storage requirements are miniscule (as little as 1 MB).
Formant synthesis delivers intelligible speech when
special care is given to consonants.
In the 1970s, a new method started to compete
with the, by then, well-established formant synthesis
method. Due to its main feature of stitching together
recorded snippets of natural speech, it was called con-
catenative synthesis. Many different options exist for
selecting the specific kin d of elementary speech units
to concatenate. Using words as such units, although
intuitive, is not a good choice given that there are many
tens of thousands of them in a language and that each
recorded word would have to fit into several different
contexts with its neighbors, creating the need to record
several versions of each word. Therefore, word-based
concatenation usually sounds very choppy and artifi-
cial. However, subword units, such as diphones or
demisyllables turned out to be much more useful be-
cause of favorable statistics. For English, there is a
minimum of about 1500
▶ diphones that would need
to be in the inventory of a diphone-based
concatenative synthesizer. The number is only slightly
higher for concatenating
▶ demisyllables. For both
kinds of units, however, elaborate methods are needed
to identify the best single (or few) instances of units to
store in the voice inventory, based on statistical mea-
sures of acoustic typicality and ease of concatenation,
with a minimum of audible g litches. In addition, at
synthesis time, elaborate speech signal processing is
needed to assure smooth transitions, deliver the de-
sired prosody, etc. For more details on this method, see
[5]. Concatenative synthesis, like formant synthesis,
delivers highly intelligible speech and usually has no
problem with transients like stop consonants, but usu-
ally lacks naturalness and thus cannot match the qual-
ity of direct human voice recordings. Its storage
requirements are moderate by today’s standards
(10–100 MB).
Unit Selection Synthesis
The effort and care given to creating the voice inventory
determines to a large extent the quality of any concatena-
tive synthesizer . For best results, most concatenative syn-
thesis researchers well up into the 1990s employed a
largely manual off-line process of trial and error that
relied on dedicated experts. A selected unit needed to fit
all possible contexts (or made to fit by signal processing
such as, stretching or shrinking durations, pitch scaling,
etc.). Howev er, morphing any given unit by signal proces-
sing in the synthesizer at synthesis time degrades voice
quality . So, the idea was born to minimize the use of signal
processing by taking advantage of the ever increasing
power of c omputers to handle ever increasing data sets.
Insteadofoutrightmorphingaunittomakeitfit,the
synthesizer may try to pick a suitable unit fr om a large
number of available candidates, optionally followed b y
much more moderate signal processing. The objective
is to find automatically the optimal sequence of unit
instances at synthesis time, given a large inventory of
unit candidates and the available sentence to be synthe-
sized. This new objective turned the speech synthesis
problem into a rapid search problem [6].
The process of selecting the right units in the in-
ventory that instantiate a given input text, appropri-
ately called unit selection, is outlined in Fig. 1. Here,
the word ‘‘two’’ (or ‘‘to’’) is synthesized from using
diphone candidates for silence into ‘‘t’’ (/#-t/), ‘‘t’’
into ‘‘uw’’ (/t-uw/), and ‘‘uw’’ into silence (/uw-#/).
1384
V
Voice Sample Synthesis