MIR in action: ISMIR 2011: lessons from speech and image processing

Wednesday, 2 November 2011

ISMIR 2011: lessons from speech and image processing

Cool ideas from speech and image processing tend to make their way gradually into music-related research, and a few of them were on display at ISMIR.

The Best Student Paper prize went to Mikael Henaff from NYU's Courant Institute for his research applying Predictive Sparse Decomposition to audio spectrograms, to learn sparse features for classification. This uses some nice techniques that were new to me, firstly to sharpen the spectrogram, and then to learn basis functions efficiently by gradient descent. Not only do the resulting features give impressive classification accuracy even with a relatively simple classifier, but - as you can see from the figure on the right - many of them appear to learn musical intervals and chords directly from the spectrogram.

The music similarity system submitted to MIREX by Geoffroy Peeters' team from IRCAM models each track as a Gaussian Mixture Model supervector. This representation is explained in a paper from the DAFX conference from a few weeks back, and borrows from some recent methods in voice recognition. You start by training a Universal Background Model, which represents the sound of "music in general". Any particular track can then be represented as a vector of differences from the UBM, which are learned by a simple iterative process. These resulting "supervectors" can then be efficiently compared to one another with Euclidean distance.