
personal recordings. For real world audio–visual data
the text can be generated using automatic speech recog-
nition (ASR), the speaker labeled using speaker recogni-
tion, and the speaker turns and segments derived can be
used for indexing the associated audio and video.
The general unsupervised speaker segmentation
problem, in addition to not having models or other
information to help segment the speech data by speaker,
brings several additional obstacles that complicate the
task of separating the segments of one speaker from the
segments of another speaker. For example, multispea-
ker speech data typically includes several short seg-
ments. Short segments are difficult to analyze because
of the inherent instability of short analysis windows. In
addition, more than one speaker may be talking at the
same time in multispeaker speech data and the seg-
ments may be contaminated with the speech of another
speaker. Also, the accuracy of the segmentation process
is affected by background noise and/or music. This
leads to the need of modeling of these artifacts, which
in turn increases system complexity. Other difficulties
are related to the dynamic fine-tuning of some para-
meters that improve the accuracy of the segmentation
algorithms. It is also a major concern into optimizing
the system performance in terms of access times and
signal processing speed. It is highly desirable that these
segmentation tasks are accomplished automatically
with the least user intervention but additionally these
need to be performed fast and accurately.
The task of speaker segmentation can be considered
as an evolution of a Voice Activity Detection (VAD),
also referred to as Speech Activity Detection (SAD).
VAD constitutes a very basic task for most speech-
based technologies (Speech Coding, automatic speech
recognition (ASR), Speaker Recognition (SR), speaker
segmentation, voice recording, noise suppression and
others). The classification of an audio recording in
speech and nonspeech segments can be utilized to
achieve more efficient coding and recognition.
Grouping together segments from the same speaker,
i.e.,
▶ speaker clustering, is also a crucial step for
segmentation. Speaker segmentation followed by
speaker clustering is referred to as speaker diarization.
Diarization has received much attention recently. It
is the process of automatically splitting the audio re-
cording into speaker segments and determining which
segments are uttered by the same speaker. In general,
diarization can also encompass speaker verification and
speaker identification tasks.
Speaker clustering also belongs to the pattern
classification family. Clustering data into classes is a
well-studied technique for statistical data analysis, with
applications in many fields, and, in general, can be
defined as unsupervised classification of data, i.e.,
without any a priori knowledge about the classes or
the number of classes. In the speaker diarization task,
the clustering process should result, ideally, in a single
cluster for every speaker identity. The most common
approach is to use a hierarchical agglomerative cluster-
ing approach in order to group together segments
from the same speaker [1]. Hierarchical agglomerative
clustering typically begins with a large number of clus-
ters which are merged pair-wise, until arriving (ideally)
at a single cluster per speaker. Since the number of
speakers is not known a priori, a threshold on the
relative change in cluster distance is used to determine
the stopping point (i.e., number of speakers). Deter-
mining the number of speakers can be difficult in appli-
cations where some speakers speak only during a very
short period of time (e.g., in news sound bites or back
channels in meetings), since they tend to be clustered in
with other speakers. Although there are several para-
meters to tune in a clustering system, the most crucial is
the distance function between clusters, which impacts
on the effectiveness of finding small clusters.
Examples of Efforts to Foster Speaker
Segmentation Research
The Defense Advanced Research Projects Agency
(DARPA) and U.S. National Science Foundation have
promoted research in speech technologies for a w ide
range of tasks from the late 1980s. Additionally, there
are significant speech research programs elsewhere in
the world, such as European Union funded projects.
The Information Technology Laboratory (ITL) of
the National Institute of Standards and Technology
(NIST), has the broad mission of supporting U.S. indu-
stry, government, and academia by promoting U.S.
innovation and industrial competitiveness through
advancement of information technology measurement
science, standards, and technology in ways that enhance
economic security and improve our quality of life.
From 1996 the N IST Speech Group, collaborating
with several other Government agencies and research
institutions, contributes to the advancement of the
state-of-the-art in human language technologies and
1278
S
Speaker Segmentation