Wednesday 2 November 2011

ISMIR 2011: lessons from speech and image processing

Cool ideas from speech and image processing tend to make their way gradually into music-related research, and a few of them were on display at ISMIR.

The Best Student Paper prize went to Mikael Henaff from NYU's Courant Institute for his research applying Predictive Sparse Decomposition to audio spectrograms, to learn sparse features for classification. This uses some nice techniques that were new to me, firstly to sharpen the spectrogram, and then to learn basis functions efficiently by gradient descent. Not only do the resulting features give impressive classification accuracy even with a relatively simple classifier, but - as you can see from the figure on the right - many of them appear to learn musical intervals and chords directly from the spectrogram.

The music similarity system submitted to MIREX by Geoffroy Peeters' team from IRCAM models each track as a Gaussian Mixture Model supervector. This representation is explained in a paper from the DAFX conference from a few weeks back, and borrows from some recent methods in voice recognition. You start by training a Universal Background Model, which represents the sound of "music in general". Any particular track can then be represented as a vector of differences from the UBM, which are learned by a simple iterative process. These resulting "supervectors" can then be efficiently compared to one another with Euclidean distance.

ISMIR 2011: music and lyrics

There were only a few papers about lyrics at ISMIR this year but some of them were really interesting.

Matt McVicar tested a rather profound hypothesis about the relationship of music and lyrics: that the common determining factor for music and words is the mood that the songwriter and composer hope to create together. Matt showed how principal axes apparently representing the valence (sad-happy) and arousal (calm-excited) scales commonly used to describe emotions fall directly out of a Canonical Correlation Analysis applied to lyrics and audio features for 120,000 songs. As Matt admitted in his presentation, the principal axes found by CCA in the audio space are very highly correlated with energy and loudness, so what's been found so far may simply show the association of happy, active lyrics with loud music and vice versa. But this is an elegant line of research that I'd enjoy seeing pursued in more depth.

Xiao Hu from the University of Denver looked just at the lyrics of 2,700 songs from the MIREX Mood Tag Dataset, which have mood annotations based on Last.fm tags. Her investigations concern the relationship of mood and creativity, defined in terms of some simple textual features mainly concerning the size and diversity of vocabulary used in a song. In a nutshell it seems that there are more ways to feel sad than there are to feel happy. But maybe you knew that already.

Another takehome message for me was that even simple textual features may turn out to be pretty useful for some music classification and prediction tasks. Rudolf Mayer and Andreas Rauber of TUW report some fancy methods of combining features for genre classification. They see a hefty increase in accuracy when using statistical features to summarise the style of lyrics in addition to audio features, presumably because musical genres have their own characteristic lyrical styles and poetic forms too.

Tuesday 1 November 2011

ISMIR 2011: data, data and more data

This year's ISMIR conference saw some powerful critiques of current evaluation datasets for MIR tasks, but also some great new releases of data that should help us start to do a better job.

First the criticisms. Julian Urbano pointed out a widening gap in sophistication between the annual MIREX algorithm bake-off and more established equivalents such as TREC for document and image search. For anyone not completely persuaded by his arguments, Fabien Gouyon and his team demonstrated convincingly that current music autotagging algorithms fail to generalise from one dataset to another i.e. at the moment there is no hard evidence that they are really learning anything at all: the main cause is probably that the available reference datasets are simply too small.

And now the data:

Wow that's a lot of new data. Time to get down to some algorithm development!