linguistics [100]. Although a questionable oversimplification of music, among
other rules their theory includes the metrical structure, as in our representation.
However, in addition to pitch and rhythm, we introduce the notion of hierarchi-
cal timbre structure, a perceptually grounded description of music audio based
on the metrical organization of its timbral surface: the perceived spectral shape
in time, as described in section 3.3.
4.2.2 Global timbre methods
Global timbre analysis has received much attention in recent years as a means
to measure the similarity of songs [11][6][69]. Typically, the estimation is built
upon a pattern-recognition architecture. The algorithm aims at capturing the
overall timbre distance between songs, given a large set of short audio frames, a
small set of acoustic features, a statistical model of their distribution, and a dis-
tance function. It was shown, howe ver, by Aucouturier and Pachet, that these
generic approaches quickly lead to a “glass ceiling,” at about 65% R-precis ion
[7]. They conclude that substantial improvements would most likely rely on a
“deeper understanding of the cognitive processes underlying the perception of
complex polyphonic timbres, and the assessment of their similarity.”
It is indeed unclear how humans perceive the superposition of sounds, or what
“global” means, and how much it is actually more significant than “local” sim-
ilarities. Comparing most salient segments, or patterns in songs, may perhaps
lead to more meaningful strategies.
4.2.3 Rhythmic similarities
A similarity analysis of rhythmic patterns is proposed by Paulus and Klapuri
[130]. Their method, only tested with drum sounds, consists of aligning tempo-
variant patterns via dynamic time warping (DTW), and comparing their nor-
malized spectral centroid, weighted with the log-energy of the signal. It is
pointed out that aligning patterns turns out to be the most difficult task.
Ellis and Arroyo [46] present an approach to rhythm similarity called “eigen-
rhythm” using Principle Component Analysis (PCA) of MIDI drum patterns
from 100 popular songs. First, the length and downbeat of input patterns are
estimated via autocorrelation and by alignment to an average “reference” pat-
tern. Finally, through PCA analysis, they reduce the high-dimensionality data
to a small space of combined “basis” patterns that can be used for classification
and visualization.
In [129], Parry and Essa extend the notion of rhythmic elaboration to audio, first
prop os ed by Tanguiane [158] for symbolic music. They divide the amplitude
envelope via beat tracking, and measure the pattern length as in [130]. This
4.2. RELATED WORK 65