Deep Learning for Music


I have been working on classical music information retrieval problems for some time now. I am currently supervising students, to build useful systems that require little or no target data.

Music Transcription with Convolutional Sequence-to-Sequence models (2017)

Karen Ullrich, Eelco van der Wel [PDF]
UNDER SUBMISSION for the International Conference on Music Information Retrieval Workshops (ISMIR) 2017, Suzhou, China.

Automatic Music Transcription (AMT) is a fundamental problem in Music Information Retrieval (MIR). The challenge is to translate an audio sequence to a symbolic representation of music. Recently, convolutional neural networks (CNNs) have been successfully applied to the task by translating frames of audio. However, those models can by their nature not model temporal relations and long time dependencies. Furthermore, it is extremely labor intense to get annotations for supervised learning in this setting. We propose a model that overcomes all these problems. The convolutional sequence to sequence (Cseq2seq) model applies a CNN to learn a low dimensional representation of audio frames and a sequential model to translate these learned features to a symbolic representation directly. Our approach has three advantages over other methods: (i) extracting audio frame representations and learning the sequential model is jointly trained end-to-end, (ii) the recurrent model can capture temporal features in musical pieces in order to improve transcription, and (iii) our model learns from entire sequences as opposed to temporally accurately annotated onsets and offsets for each note thus making it possible to train on large already existing corpora of music. For the purpose of testing our method we created our own dataset of 17K monophonic songs and respective MusicXML files. Initial experiments proof the validity of our approach.

Optical Music Recognition with Convolutional Sequence-to-Sequence models (2017)

Eelco van der Wel, Karen Ullrich [PDF]
Accepted paper at the International Conference on Music Information Retrieval (ISMIR) 2017, Suzhou, China.

Optical Music Recognition (OMR) is an important technology within Music Information Retrieval. Deep learning models show promising results on OMR tasks, but symbol-level annotated data sets of sufficient size to train such models are not available and difficult to develop. We present a novel deep learning architecture called a Convolutional Sequence-to-Sequence model to both move towards an end-to-end trainable OMR pipeline, and improve the learning process by training on full sentences of sheet music instead of individually labeled symbols. The model is trained and evaluated on a human generated data set, with various image augmentations based on real-world scenarios. This data set is the first publicly available set in OMR research with sufficient size to train and evaluate deep learning models. With the introduced augmentations a pitch recognition accuracy of 81% and a duration accuracy of 94% is achieved, resulting in a note level accuracy of 80%.

Structural Segmentation with convolutional neural networks Mirex Submission (2014)

Jan Schülter, Karen Ullrich, Thomas Grill [BIBTEX]
Music Information Retrieval Evaluation eXchange Challange.

Boundary Detection in Music Structure Analysis using Convolutional Neural Networks (2014)

Karen Ullrich, Jan Schülter, Thomas Grill [PDF] [BIBTEX] [DEMO]
Accapted paper at the International Conference on Music Information Retrieval (ISMIR) 2014, Taipei, Taiwan.

The recognition of boundaries, e.g., between chorus and verse, is an important task in music structure analysis. The goal is to automatically detect such boundaries in audio signals so that the results are close to human annotation. In this work, we apply Convolutional Neural Networks to the task, trained directly on mel-scaled magnitude spectrograms. On a representative subset of the SALAMI structural annotation dataset, our method outperforms current techniques in terms of boundary retrieval F-measure at different temporal tolerances: We advance the state-of-the-art from 0.33 to 0.46 for tolerances of ±0.5 seconds, and from 0.52 to 0.62 for tolerances of ±3 seconds. As the algorithm is trained on annotated audio data without the need of expert knowledge, we expect it to be easily adaptable to changed annotation guidelines and also to related tasks such as the detection of song transitions.

Feed-Forward Neural Networks for boundary detection in music structure Analysis (2014)

Master Thesis, University of Amsterdam, Amsterdam, The Netherlands. [PDF] [BIBTEX]

Following pioneering studies that first applied neural networks in the field of music information retrieval (MIR), we apply feed forward neutral networks to retrieve boundaries in musical pieces, e.g., between chorus and verse. Detecting such segment boundaries is an important task in music structure analysis, a sub-domain of MIR. To that end, we developed a framework to perform supervised learning on a representative subset of the SALAMI data set, containing structural annotations. More specifically, we apply convolutional networks to learn spatial relationships and fully connected layers, to detect segment boundaries automatically. In that context, the data was presented to the networks as mel-scaled magnitude spectrograms. Furthermore, we applied the dropout technique. After optimising our models with respect to various hyper-parameters, we find them to outperform the F-score of any algorithm in the MIREX campaign of 2012 and 2013. In particular, we achieved F-measures of 0.476 for tolerances of ±0.5s and 0.619 for tolerances of ±3s. These are differences to current techniques of 0.14 and 0.09. Our method is particularly outstanding because it is mainly data driven and does not utilize hand-crafted high-level features to create classifiers. When investigating the method further, we find that even with such a simple general-purpose feature as chroma vectors and no convolutional layers we can still achieve results comparable to existing algorithms. Moreover, we visualised which regions of the input are of highest interest for our networks. As a result, we found all networks to concentrate on very similar time and frequency bands.