could appropriately be used as a phase-locking system for the beat tracker of
section 3.5, which currently runs in an open loop (i.e., without feedback control
mechanism). This is left for future work.
5.3.1 Downbeat training
The downbeat prediction is supervised. The training is a semi-automatic task
that requires little human control. If our beat tracker is accurate throughout
the whole training song, and the measure length is constant, we label only one
beat with an integer value p
b
∈ [0, M − 1], where M is the number of beats
in the measure, and where 0 is the downbeat. The system extrapolates the
beat-phase labeling to the rest of the song. In general, we can label the data
by tapping the downbeats along with the music in real time, and by recording
their location in a text file. The system finally labels segments with a phase
location: a float value p
s
∈ [0, M[. The resulting segment phase signal looks
like a sawtooth ranging from 0 to M. Taking the absolute value of its derivative
returns our ground-truth downbeat prediction signal, as displayed in the top
pane of Figure 5-5. Another good option consists of labeling tatums (section
3.4.3) rather than segments.
The listening stage, including auditory spectrogram, segmentation, and music
feature labeling, is entirely unsupervised (Figure 3-19). So is the construction of
the time-lag feature vector, which is built by appending an arbitrary number of
preceding multidimensional feature vectors. Best results were obtained using 6
to 12 past segments, corresponding to nearly the length of a measure. We model
short-term memory fading by linearly scaling down older segments, therefore
increasing the weight of most recent segments (Figure 5-5).
Training a supp ort vector machine to predict the downbeat corresponds to a
regression task of several dozens of feature dimensions (e.g., 9 past segments
× 42 features per segments = 378 features) into one single dimension (the cor-
responding downbeat phase of the next segment). Several variations of this
principle are also possible. For instance, an additional PCA step (section 5.2.3)
allows us to reduce the space considerably while preserving most of its entropy.
We arbitrarily select the first 20 eigen-dimensions (Figure 5-6), which generally
accounts for about 60–80% of the total entropy while reducing the size of the
feature space by an order of magnitude. It was found that results are almost
equivalent, while the learning process gains in computation speed. Another
approach that we have tested consists of selecting the relative features of a run-
ning self-similarity triangular matrix rather than the original absolute features,
e.g., ((9 past segments)
2
−9)/2 = 36 features. Results were found to be roughly
equivalent, and also faster to compute.
We expect that the resulting model is not only able to predict the downbeat
of our training data set, but to generalize well enough to predict the downbeat
86 CHAPTER 5. LEARNING MUSIC SIGNALS