ture. The technology enables, for example, cell-phone carriers or copy-
right management services to automatically identify audio by comparing
unique “fingerprints” extracted live from the audio with fingerprints in a
specially compiled music database running on a central server [23].
Query by description consists of querying a large MIDI or audio database
by providing qualitative text descriptors of the music, or by “humming”
the tune of a song into a microphone (query by humming). The system
typically compares the entry with a pre-analyzed database metric, and
usually ranks the results by similarity [171][54][26].
Music similarity is an attempt at estimating the closeness of music signals.
There are many criteria with which we may estimate similarities, in-
cluding editorial (title, artist, country), cultural (genre, subjective quali-
fiers), symbolic (melody, harmony, structure), perceptual (energy, texture,
beat), and even cognitive (experience, reference) [167][6][69][9].
Classification tasks integrate similarity technologies as a way to cluster music
into a finite set of classes, including genre, artist, rhythm, instrument,
etc. [105][163][47]. Similarity and classification applications often face
the primary question of defining a ground truth to be taken as actual
facts for evaluating the results without error.
Thumbnailing consists of building the most “representative” audio summary
of a piece of music, for instance by removing the most redundant and
least salient sections from it. The task is to detect the boundaries and
similarities of large musical structures, such as verses and choruses, and
finally assemble them together [132][59][27].
The “Music Browser,” developed by Sony CSL, IRCAM, UPF, Fraunhofer, and
others, as part of a European effort (Cuidado, Semantic Hi-Fi) is the “first
entirely automatic chain for extracting and exploiting musical metadata for
browsing music” [124]. It incorporates several techniques for music description
and data mining, and allows for a variety of queries based on editorial (i.e.,
entered manually by an editor) or acoustic metadata (i.e., the sound of the
sound), as well as providing browsing tools and sharing capabilities among
users.
Although this thesis deals exclusively with the extraction and use of acous-
tic metadata, music as a whole cannot be solely characterized by its “objec-
tive” content. Music, as exp e rienced by listeners, carries much “subjective”
value that evolves in time through communities. Cultural metadata attached
to music can be extracted online in a textual form through web crawling and
natural-language processing [125][170]. Only a combination of these different
types of metadata (i.e., acoustic, cultural, editorial) can lead to viable music
management and retrieval systems [11][123][169].
32 CHAPTER 2. BACKGROUND