164
Chapter 10: A Formula of Alberto Calderon in Speaker Identification
10.1
Introduction
The project concerned the very concrete problem of speaker identification.
This usually concerns a situation where speech fragments of a number of
speakers have been stored; when a new speech fragment is presented, the
speaker identification system should be able to recognize with a reasonably
high degree of accuracy whether or not this is one of the previously sampled
speakers, and, if so, who it is.
Ideally, this should work even if the specific
utterance in the piece of speech under scrutiny is different from any that
were encountered- before. The problem is thus to identify, and later to de-
tect, reliable parameters that characterize the speaker independently of the
utterance. There exist various approaches that perform very well on "clean"
speech, that is, when both the previously stored samples and the speech
fragment for which the speaker has to be identified have very low noise lev-
els. Most models break down at noise levels far below those where our own
auditory recognition system starts to fail. Because of the connection of the
wavelet transform with the auditory system, and because there existed other
indications that an auditory-system-based approach might be more robust
than existing methods, we decided to construct a wavelet-based approach to
this problem.
This paper is organized as follows. Sections 10.2 to 10.4 present back-
ground material, explaining respectively (1) how the (continuous) wavelet
transform, which is essentially the same as a decomposition formula proposed
by A. Calder6n in the early sixties (see (10.2) below), comes up "naturally"
in our auditory system, (2) a heuristic approach (the ensemble interval his-
togram of 0. Ghitza [1]) based on auditory nerve models, which eliminates
much of the redundancy in the first-stage transform, and (3) the modula-
tion model, valid for large portions of (voiced) speech, and which is used for
speaker identification.
(Note that our descriptions of the auditory system
are very naive and distorted. They are in no way meant as an accurate de-
scription of what is well known to be a very complex system. Rather, they
are snapshots that motivated our mathematical construction further on, and
they should be taken only as such.) In §10.5 we put all this background ma-
terial to use in our own synthesis, an approach that we call "squeezing" the
wavelet transform; with an extra refinement this becomes "synchrosqueez-
ing." The main idea is that the wavelet transform itself has "smeared" out
different harmonic components, and that we need to "refocus" the resulting
time-frequency or time-scale picture. How this is done is explained in §10.5.
Section 10.6 sketches a few implementation issues. Finally, §10.7 shows some
results: the "untreated" wavelet transform of a speech segment, its squeezed
and synchrosqueezed versions, and the extraction of the parameters used
for speaker identification. We conclude with some pointers to and compar-
isons with similar work in the literature, and with sketching possible future
directions.