Let us begin with a monophonic audio signal of arbitrary length and sound
quality. Since we are only concerned with the human appreciation of music,
the signal may have been formerly compressed, filtered, or resampled. The
music can be of any kind: we have tested our system with excerpts taken from
jazz, classical, funk, electronic, rock, pop, folk and traditional music, as well as
speech, environmental sounds, and drum loops.
3.1 Auditory Spectrogram
The goal of our auditory spectrogram is to convert the time-domain waveform
into a reduced, yet perceptually meaningful, time-frequency representation. We
seek to remove the information that is the least critical to our hearing sensation
while retaining the most important parts, therefore reducing signal complexity
without perceptual loss. The MPEG1 audio layer 3 (MP3) codec [18] is a good
example of an application that exploits this principle for compression purposes.
Our primary interest here is understanding our perception of the signal rather
than resynthesizing it, therefore the reduction process is sometimes simplified,
but also extended and fully parametric in comparison with usual perceptual
audio coders.
3.1.1 Spectral representation
First, we apply a standard Short-Time Fourier Transform (STFT) to obtain a
standard spectrogram. We experimented with many window types and sizes,
which did not have a significant impact on the final results. However, since we
are mostly concerned with timing accuracy, we favor short windows (e.g., 12-ms
Hanning), which we compute every 3–6 ms (i.e., every 128–256 samples at 44.1
KHz). The Fast Fourier Transform (FFT) is zero-padded up to 46 ms to gain
additional interpolated frequency bins. We calculate the power spectrum and
scale its amplitude axis to decibels (dB SPL, a measure of sound pressure level)
as in the following equation:
I
i
(dB) = 20 log
10
I
i
I
0
(3.1)
where i > 0 is an index of the power-spectrum bin of intensity I, and I
0
is an arbitrary threshold of hearing intensity. For a reasonable tradeoff between
dynamic range and resolution, we choose I
0
= 60, and we clip sound pressure
levels b e low -60 dB. The threshold of hearing is in fact frequency-dependent
and is a consequence of the outer and middle ear response.
3.1. AUDITORY SPECTROGRAM 43