MIR in action: 2011

Wednesday, 2 November 2011

ISMIR 2011: lessons from speech and image processing

Cool ideas from speech and image processing tend to make their way gradually into music-related research, and a few of them were on display at ISMIR.

The Best Student Paper prize went to Mikael Henaff from NYU's Courant Institute for his research applying Predictive Sparse Decomposition to audio spectrograms, to learn sparse features for classification. This uses some nice techniques that were new to me, firstly to sharpen the spectrogram, and then to learn basis functions efficiently by gradient descent. Not only do the resulting features give impressive classification accuracy even with a relatively simple classifier, but - as you can see from the figure on the right - many of them appear to learn musical intervals and chords directly from the spectrogram.

The music similarity system submitted to MIREX by Geoffroy Peeters' team from IRCAM models each track as a Gaussian Mixture Model supervector. This representation is explained in a paper from the DAFX conference from a few weeks back, and borrows from some recent methods in voice recognition. You start by training a Universal Background Model, which represents the sound of "music in general". Any particular track can then be represented as a vector of differences from the UBM, which are learned by a simple iterative process. These resulting "supervectors" can then be efficiently compared to one another with Euclidean distance.

ISMIR 2011: music and lyrics

There were only a few papers about lyrics at ISMIR this year but some of them were really interesting.

Matt McVicar tested a rather profound hypothesis about the relationship of music and lyrics: that the common determining factor for music and words is the mood that the songwriter and composer hope to create together. Matt showed how principal axes apparently representing the valence (sad-happy) and arousal (calm-excited) scales commonly used to describe emotions fall directly out of a Canonical Correlation Analysis applied to lyrics and audio features for 120,000 songs. As Matt admitted in his presentation, the principal axes found by CCA in the audio space are very highly correlated with energy and loudness, so what's been found so far may simply show the association of happy, active lyrics with loud music and vice versa. But this is an elegant line of research that I'd enjoy seeing pursued in more depth.

Xiao Hu from the University of Denver looked just at the lyrics of 2,700 songs from the MIREX Mood Tag Dataset, which have mood annotations based on Last.fm tags. Her investigations concern the relationship of mood and creativity, defined in terms of some simple textual features mainly concerning the size and diversity of vocabulary used in a song. In a nutshell it seems that there are more ways to feel sad than there are to feel happy. But maybe you knew that already.

Another takehome message for me was that even simple textual features may turn out to be pretty useful for some music classification and prediction tasks. Rudolf Mayer and Andreas Rauber of TUW report some fancy methods of combining features for genre classification. They see a hefty increase in accuracy when using statistical features to summarise the style of lyrics in addition to audio features, presumably because musical genres have their own characteristic lyrical styles and poetic forms too.

Tuesday, 1 November 2011

ISMIR 2011: data, data and more data

This year's ISMIR conference saw some powerful critiques of current evaluation datasets for MIR tasks, but also some great new releases of data that should help us start to do a better job.

First the criticisms. Julian Urbano pointed out a widening gap in sophistication between the annual MIREX algorithm bake-off and more established equivalents such as TREC for document and image search. For anyone not completely persuaded by his arguments, Fabien Gouyon and his team demonstrated convincingly that current music autotagging algorithms fail to generalise from one dataset to another i.e. at the moment there is no hard evidence that they are really learning anything at all: the main cause is probably that the available reference datasets are simply too small.

And now the data:

Thierry Bertin-Mailleux introduced some hefty additions to the Million Song Dataset, including social tags and similarity data from Last.fm for hundreds of thousands of tracks.
Vladimir Viro's Peachnote exposes a lightning fast melody and chord search UI, as well as an API and data downloads of 45,000 classical scores scanned in using OMR, including the whole IMSLP collection.
the MARL group at NYU have made chord annotations for 300 pop songs, available in their git repository here: git://github.com/tmc323/Chord-Annotations.git.
the CIRMMT group at McGill are releasing chord annotations for a thousand songs from the charts.
the METISS research group have made structural annotations for 500 songs including recent finalists from the Eurovision Song Contest.
the SALAMI project have set their musical sights a little higher by producing structural annotations for 1400 recordings across diverse genres including world and classical music.
back to the mainstream with the Now That's What I Call Music mood annotations for 2600 songs by 4 expert listeners from TU München.
the BBC have crowd-sourced 56,000 mood annotations for well-known TV theme tunes from 15,000 listeners, and the audio is also available on request for academic researchers.
Last.fm are releasing 40,000 tempo labels and bpm estimates crowd-sourced from 2,000 listeners, showing that at least one well-known source of automatic bpm values is wrong around half of the time.
finally Rob Macrae and Matt McVicar have both come up with methods to make hundreds of thousands of guitar tabs available on the web usable as approximate groundtruth for chord transcription.

Wow that's a lot of new data. Time to get down to some algorithm development!

Wednesday, 26 October 2011

ISMIR 2011: crowd sourcing

One of the themes coming out of today's sessions at ISMIR is a growing interest in crowd sourcing groundtruth and evaluation data for MIR tasks. Masataka Goto introduced his amazing Songle application, which uses an impressively-crafted Flash application to display automatically extracted chord symbols, melody line, and more, for any mp3 on the web, and also allows users to correct the automatic annotations by hand. Meanwhile Dan Stowell showed a delightfully simple web interface designed for use in the school classroom, which displays automatically extracted harmony information alongside any embedded YouTube music video: truly MIR for the masses.

Finally here are the slides from my own talk about tempo estimation:

Friday, 26 August 2011

Talks...

I'll be presenting a paper on crowd-sourcing data to improve musical tempo estimation at this year's ISMIR conference in Miami in October. This reports some research around a little interactive demo which we released a while ago on Last.fm's labs website. If you haven't tried it yet then give it a go, it's quite fun... and we'll get a bit more data.

I'm also giving another talk on Hadoop at the Stack Overflow Dev Days in London in November. The nice Stack Overflow folks have also given me some $100 discount codes to hand out, drop me a line if you'd like one, they are also valid for the upcoming Dev Days in other cities. Update: Dev Days has been cancelled, but it turns out that I'll be speaking on Hadoop at UCL on 16 November.

Friday, 15 April 2011

Algorithms on Hadoop

I gave a talk last night on some fun things we do with our Hadoop cluster at Last.fm including

topic modelling with LDA
graph-based recommendation with Label Propagation
audio analysis

Here are the slides.

Algorithms on Hadoop at Last.fm

View more presentations from Mark Levy

MIR in action