I have been working on classical music information retrieval problems for some time now. I am currently supervising students, to build useful systems that require little or no target data.
Automatic Music Transcription (AMT) is a fundamental problem in Music Information Retrieval (MIR). The challenge is to translate an audio sequence to a symbolic representation of music.
Recently, convolutional neural networks (CNNs) have been successfully applied to the task by translating frames of audio. However, those models can by their nature not model temporal relations and long time dependencies. Furthermore, it is extremely labor intense to get annotations for supervised learning in this setting.
We propose a model that overcomes all these problems. The convolutional sequence to sequence (Cseq2seq) model applies a CNN to learn a low dimensional representation of audio frames and a sequential model to translate these learned features to a symbolic representation directly.
Our approach has three advantages over other methods: (i) extracting audio frame representations and learning the sequential model is jointly trained end-to-end, (ii) the recurrent model can capture temporal features in musical pieces in order to improve transcription, and (iii) our model learns from entire sequences as opposed to temporally accurately annotated onsets and offsets for each note thus making it possible to train on large already existing corpora of music.
For the purpose of testing our method we created our own dataset of 17K monophonic songs and respective MusicXML files. Initial experiments proof the validity of our approach.
Optical Music Recognition (OMR) is an important technology within Music Information Retrieval. Deep learning models show promising results on OMR tasks, but
symbol-level annotated data sets of sufficient size to train
such models are not available and difficult to develop. We
present a novel deep learning architecture called a Convolutional Sequence-to-Sequence model to both move towards an end-to-end trainable OMR pipeline, and improve
the learning process by training on full sentences of sheet
music instead of individually labeled symbols. The model
is trained and evaluated on a human generated data set,
with various image augmentations based on real-world
scenarios. This data set is the first publicly available set
in OMR research with sufficient size to train and evaluate
deep learning models. With the introduced augmentations
a pitch recognition accuracy of 81% and a duration accuracy of 94% is achieved, resulting in a note level accuracy
The recognition of boundaries, e.g., between chorus and
verse, is an important task in music structure analysis. The
goal is to automatically detect such boundaries in audio
signals so that the results are close to human annotation.
In this work, we apply Convolutional Neural Networks to
the task, trained directly on mel-scaled magnitude spectrograms.
On a representative subset of the SALAMI structural
annotation dataset, our method outperforms current
techniques in terms of boundary retrieval F-measure at different
temporal tolerances: We advance the state-of-the-art
from 0.33 to 0.46 for tolerances of ±0.5 seconds, and from
0.52 to 0.62 for tolerances of ±3 seconds. As the algorithm
is trained on annotated audio data without the need
of expert knowledge, we expect it to be easily adaptable
to changed annotation guidelines and also to related tasks
such as the detection of song transitions.
Following pioneering studies that first applied neural networks in the field of music information retrieval (MIR),
we apply feed forward neutral networks to retrieve boundaries in musical pieces, e.g., between chorus and verse.
Detecting such segment boundaries is an important task in music structure analysis, a sub-domain of MIR. To
that end, we developed a framework to perform supervised learning on a representative subset of the SALAMI
data set, containing structural annotations. More specifically, we apply convolutional networks to learn spatial
relationships and fully connected layers, to detect segment boundaries automatically. In that context, the data
was presented to the networks as mel-scaled magnitude spectrograms. Furthermore, we applied the dropout
technique. After optimising our models with respect to various hyper-parameters, we find them to outperform
the F-score of any algorithm in the MIREX campaign of 2012 and 2013. In particular, we achieved F-measures
of 0.476 for tolerances of ±0.5s and 0.619 for tolerances of ±3s. These are differences to current techniques
of 0.14 and 0.09. Our method is particularly outstanding because it is mainly data driven and does not utilize
hand-crafted high-level features to create classifiers. When investigating the method further, we find that even
with such a simple general-purpose feature as chroma vectors and no convolutional layers we can still achieve
results comparable to existing algorithms. Moreover, we visualised which regions of the input are of highest
interest for our networks. As a result, we found all networks to concentrate on very similar time and frequency