Creating Music by Listening (137 pages)
by Tristan Jehan
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, on June 17, 2005,
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Abstract
Machines have the power and potential to make expressive music on their own.
This thesis aims to computationally model the process of creating music using
experience from listening to examples. Our unbiased signal-based solution models
the life cycle of listening, composing, and performing, tuing the machine
into an active musician, instead of simply an instrument. We accomplish this
through an analysis-synthesis technique by combined perceptual and structural
modeling of the musical surface, which leads to a minimal data representation.
We introduce a music cognition framework that results from the interaction
of psychoacoustically grounded causal listening, a time-lag embedded feature
representation, and perceptual similarity clustering. Our bottom-up analysis intends
to be generic and uniform by recursively revealing metrical hierarchies and
structures of pitch, rhythm, and timbre. Training is suggested for top-down unbiased
supervision, and is demonstrated with the prediction of downbeat. This
musical intelligence enables a range of original manipulations including song
alignment, music restoration, cross-synthesis or song morphing, and ultimately
the synthesis of original pieces.
Table of Contents
1 Introduction 23
2 Background 29
2.1 Symbolic Algorithmic Composition
2.2 Hybrid MIDI-Audio Instruments
2.3 Audio Models
2.4 Music information retrieval
2.5 Framework
2.5.1 Music analysis/resynthesis .
2.5.2 Description
2.5.3 Hierarchical description
2.5.4 Meaningful sound space
2.5.5 Personalized music synthesis
3 Music Listening 43
3.0.6 Anatomy
3.0.7 Psychoacoustics
3.1 Auditory Spectrogram
3.1.1 Spectral representation
3.1.2 Outer and middle ear
3.1.3 Frequency warping
3.1.4 Frequency masking
3.1.5 Temporal masking .
3.1.6 Putting it all together
3.2 Loudness
3.3 Timbre
3.4 Onset Detection
3.4.1 Prior approaches . . . . . . . . . . . . . . . . . . . .
3.4.2 Perceptually grounded approach . .
3.4.3 Tatum grid
3.5 Beat and Tempo
3.5.1 Comparative models
3.5.2 Our approach . .
3.6 Pitch and Harmony . . . . .
3.7 Perceptual feature space .
4 Musical Structures
4.1 Multiple Similarities
4.2 Related Work
4.2.1 Hierarchical representations
12 Contents
4.2.2 Global timbre methods
4.2.3 Rhythmic similarities
4.2.4 Self-similarities
4.3 Dynamic Programming
4.4 Sound Segment Similarity
4.5 Beat Analysis
4.6 Patte Recognition
4.6.1 Patte length
4.6.2 Heuristic approach to downbeat detection
4.6.3 Patte-synchronous similarities
4.7 Larger Sections
4.8 Chapter Conclusion .
5 Leaing Music Signals
5.1 Machine Leaing .
5.1.1 Supervised, unsupervised, and reinforcement leaing
5.1.2 Generative vs. discriminative leaing .
5.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Regression and classification . . . . . . . . . .
5.2.2 State-space forecasting . . . . . . . . . . . .
5.2.3 Principal component analysis . . . . . . . . . . . .
5.2.4 Understanding musical structures . . . . . . . . . . . . .
5.2.5 Leaing and forecasting musical structures . . . .
Contents 13
5.2.6 Support Vector Machine . . . . . . . . . . . . .
5.3 Downbeat prediction . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Downbeat training . . . . . . . . . . . . . . . . . . . .
5.3.2 The James Brown case . . . . . . . . . . . . . .
5.3.3 Inter-song generalization . . . . . . . . . . . . . . . . .
5.4 Time-Axis Redundancy Cancellation . . . . . . . . . . . . . . .
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . .
5.4.2 Nonhierarchical k-means clustering . . . . . . . .
5.4.3 Agglomerative Hierarchical Clustering . . .
5.4.4 Compression . . . . . . . . . . . . . . . . . .
5.4.5 Discussion . . . . . . . . . . . . . .
6 Composing with sounds 99
6.1 Automated DJ . . . . . . . . . . . . . . . . . .
6.1.1 Beat-matching . . . . . . . . . . . . . . . .
6.1.2 Time-scaling . . . . . . . . . . . . . .
6.2 Early Synthesis Experiments . . .
6.2.1 Scrambled Music . . . . . . . . . . . . . . . .
6.2.2 Reversed Music . . . . . . . . . . . . . . . . .
6.3 Music Restoration . . . . . . . . . . . . . . . . . . . .
6.3.1 With previously known structure . .
6.3.2 With no prior knowledge . . . . . .
6.4 Music Textures . . . . . . . . . . . . . . .
14 Contents
6.5 Music Cross-Synthesis . . . . . . . . . . . . . .
6.6 Putting it all together . . . . . . . . . . . . . .
7 Conclusion
7.1 Summary . . . . . . .
7.2 Discussion . . . . . . . . . . . . . . . . . .
7.3 Contributions . . . . . . . . . . . . . . . .
7.3.1 Scientific contributions . . .
7.3.2 Engineering contributions .
7.3.3 Artistic contributions . . . .
7.4 Future directions . . . . . . .
7.5 Final Remarks . . . . . .
Appendix A Skeleton
A.1 Machine Listening . . . .
A.2 Machine Leaing .
A.3 Music Synthesis
A.4 Software .
A.5 Database .
by Tristan Jehan
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, on June 17, 2005,
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
Abstract
Machines have the power and potential to make expressive music on their own.
This thesis aims to computationally model the process of creating music using
experience from listening to examples. Our unbiased signal-based solution models
the life cycle of listening, composing, and performing, tuing the machine
into an active musician, instead of simply an instrument. We accomplish this
through an analysis-synthesis technique by combined perceptual and structural
modeling of the musical surface, which leads to a minimal data representation.
We introduce a music cognition framework that results from the interaction
of psychoacoustically grounded causal listening, a time-lag embedded feature
representation, and perceptual similarity clustering. Our bottom-up analysis intends
to be generic and uniform by recursively revealing metrical hierarchies and
structures of pitch, rhythm, and timbre. Training is suggested for top-down unbiased
supervision, and is demonstrated with the prediction of downbeat. This
musical intelligence enables a range of original manipulations including song
alignment, music restoration, cross-synthesis or song morphing, and ultimately
the synthesis of original pieces.
Table of Contents
1 Introduction 23
2 Background 29
2.1 Symbolic Algorithmic Composition
2.2 Hybrid MIDI-Audio Instruments
2.3 Audio Models
2.4 Music information retrieval
2.5 Framework
2.5.1 Music analysis/resynthesis .
2.5.2 Description
2.5.3 Hierarchical description
2.5.4 Meaningful sound space
2.5.5 Personalized music synthesis
3 Music Listening 43
3.0.6 Anatomy
3.0.7 Psychoacoustics
3.1 Auditory Spectrogram
3.1.1 Spectral representation
3.1.2 Outer and middle ear
3.1.3 Frequency warping
3.1.4 Frequency masking
3.1.5 Temporal masking .
3.1.6 Putting it all together
3.2 Loudness
3.3 Timbre
3.4 Onset Detection
3.4.1 Prior approaches . . . . . . . . . . . . . . . . . . . .
3.4.2 Perceptually grounded approach . .
3.4.3 Tatum grid
3.5 Beat and Tempo
3.5.1 Comparative models
3.5.2 Our approach . .
3.6 Pitch and Harmony . . . . .
3.7 Perceptual feature space .
4 Musical Structures
4.1 Multiple Similarities
4.2 Related Work
4.2.1 Hierarchical representations
12 Contents
4.2.2 Global timbre methods
4.2.3 Rhythmic similarities
4.2.4 Self-similarities
4.3 Dynamic Programming
4.4 Sound Segment Similarity
4.5 Beat Analysis
4.6 Patte Recognition
4.6.1 Patte length
4.6.2 Heuristic approach to downbeat detection
4.6.3 Patte-synchronous similarities
4.7 Larger Sections
4.8 Chapter Conclusion .
5 Leaing Music Signals
5.1 Machine Leaing .
5.1.1 Supervised, unsupervised, and reinforcement leaing
5.1.2 Generative vs. discriminative leaing .
5.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Regression and classification . . . . . . . . . .
5.2.2 State-space forecasting . . . . . . . . . . . .
5.2.3 Principal component analysis . . . . . . . . . . . .
5.2.4 Understanding musical structures . . . . . . . . . . . . .
5.2.5 Leaing and forecasting musical structures . . . .
Contents 13
5.2.6 Support Vector Machine . . . . . . . . . . . . .
5.3 Downbeat prediction . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Downbeat training . . . . . . . . . . . . . . . . . . . .
5.3.2 The James Brown case . . . . . . . . . . . . . .
5.3.3 Inter-song generalization . . . . . . . . . . . . . . . . .
5.4 Time-Axis Redundancy Cancellation . . . . . . . . . . . . . . .
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . .
5.4.2 Nonhierarchical k-means clustering . . . . . . . .
5.4.3 Agglomerative Hierarchical Clustering . . .
5.4.4 Compression . . . . . . . . . . . . . . . . . .
5.4.5 Discussion . . . . . . . . . . . . . .
6 Composing with sounds 99
6.1 Automated DJ . . . . . . . . . . . . . . . . . .
6.1.1 Beat-matching . . . . . . . . . . . . . . . .
6.1.2 Time-scaling . . . . . . . . . . . . . .
6.2 Early Synthesis Experiments . . .
6.2.1 Scrambled Music . . . . . . . . . . . . . . . .
6.2.2 Reversed Music . . . . . . . . . . . . . . . . .
6.3 Music Restoration . . . . . . . . . . . . . . . . . . . .
6.3.1 With previously known structure . .
6.3.2 With no prior knowledge . . . . . .
6.4 Music Textures . . . . . . . . . . . . . . .
14 Contents
6.5 Music Cross-Synthesis . . . . . . . . . . . . . .
6.6 Putting it all together . . . . . . . . . . . . . .
7 Conclusion
7.1 Summary . . . . . . .
7.2 Discussion . . . . . . . . . . . . . . . . . .
7.3 Contributions . . . . . . . . . . . . . . . .
7.3.1 Scientific contributions . . .
7.3.2 Engineering contributions .
7.3.3 Artistic contributions . . . .
7.4 Future directions . . . . . . .
7.5 Final Remarks . . . . . .
Appendix A Skeleton
A.1 Machine Listening . . . .
A.2 Machine Leaing .
A.3 Music Synthesis
A.4 Software .
A.5 Database .