I have been working on classical music information retrieval problems for some time now. I am currently supervising students, to build useful systems that require little or no target data.
Automatic Music Transcription (AMT) is a fundamental problem in Music Information Retrieval (MIR). The challenge is to translate an audio sequence to a symbolic representation of music.
           Recently, convolutional neural networks (CNNs) have been successfully applied to the task by translating frames of audio. However, those models can by their nature not model temporal relations and long time dependencies. Furthermore, it is extremely labor intense to get annotations for supervised learning in this setting.
           We propose a model that overcomes all these problems. The convolutional sequence to sequence (Cseq2seq) model applies a CNN to learn a low dimensional representation of audio frames and a sequential model to translate these learned features to a symbolic representation directly.
           Our approach has three advantages over other methods: (i) extracting audio frame representations and learning the sequential model is jointly trained end-to-end, (ii) the recurrent model can capture temporal features in musical pieces in order to improve transcription, and (iii) our model learns from entire sequences as opposed to temporally accurately annotated onsets and offsets for each note thus making it possible to train on large already existing corpora of music.
           For the purpose of testing our method we created our own dataset of 17K monophonic songs and respective MusicXML files. Initial experiments proof the validity of our approach. 
Optical Music Recognition (OMR) is an important technology within Music Information Retrieval. Deep learning models show promising results on OMR tasks, but
             symbol-level annotated data sets of sufficient size to train
             such models are not available and difficult to develop. We
             present a novel deep learning architecture called a Convolutional Sequence-to-Sequence model to both move towards an end-to-end trainable OMR pipeline, and improve
             the learning process by training on full sentences of sheet
             music instead of individually labeled symbols. The model
             is trained and evaluated on a human generated data set,
             with various image augmentations based on real-world
             scenarios. This data set is the first publicly available set
             in OMR research with sufficient size to train and evaluate
             deep learning models. With the introduced augmentations
             a pitch recognition accuracy of 81% and a duration accuracy of 94% is achieved, resulting in a note level accuracy
             of 80%.
The recognition of boundaries, e.g., between chorus and
          verse, is an important task in music structure analysis. The
          goal is to automatically detect such boundaries in audio
          signals so that the results are close to human annotation.
          In this work, we apply Convolutional Neural Networks to
          the task, trained directly on mel-scaled magnitude spectrograms.
          On a representative subset of the SALAMI structural
          annotation dataset, our method outperforms current
          techniques in terms of boundary retrieval F-measure at different
          temporal tolerances: We advance the state-of-the-art
          from 0.33 to 0.46 for tolerances of ±0.5 seconds, and from
          0.52 to 0.62 for tolerances of ±3 seconds. As the algorithm
          is trained on annotated audio data without the need
          of expert knowledge, we expect it to be easily adaptable
          to changed annotation guidelines and also to related tasks
          such as the detection of song transitions.
Following pioneering studies that first applied neural networks in the field of music information retrieval (MIR),
          we apply feed forward neutral networks to retrieve boundaries in musical pieces, e.g., between chorus and verse.
          Detecting such segment boundaries is an important task in music structure analysis, a sub-domain of MIR. To
          that end, we developed a framework to perform supervised learning on a representative subset of the SALAMI
          data set, containing structural annotations. More specifically, we apply convolutional networks to learn spatial
          relationships and fully connected layers, to detect segment boundaries automatically. In that context, the data
          was presented to the networks as mel-scaled magnitude spectrograms. Furthermore, we applied the dropout
          technique. After optimising our models with respect to various hyper-parameters, we find them to outperform
          the F-score of any algorithm in the MIREX campaign of 2012 and 2013. In particular, we achieved F-measures
          of 0.476 for tolerances of ±0.5s and 0.619 for tolerances of ±3s. These are differences to current techniques
          of 0.14 and 0.09. Our method is particularly outstanding because it is mainly data driven and does not utilize
          hand-crafted high-level features to create classifiers. When investigating the method further, we find that even
          with such a simple general-purpose feature as chroma vectors and no convolutional layers we can still achieve
          results comparable to existing algorithms. Moreover, we visualised which regions of the input are of highest
          interest for our networks. As a result, we found all networks to concentrate on very similar time and frequency
          bands.