Monday, 16 August 2010

ISMIR 2010: code

The Best Paper award went to Ron Weiss and Juan Bello of NYU for their work on finding repeating patterns in musical audio with Probabilistic Latent Component Analysis (aka Convolutive Non-Negative Matrix Factorisation) applied to chroma features. This finds a best fit decomposition of the features considered as a linear mixture of some number of latent components convolved with corresponding activation weights that vary over time during the course of the track. Weiss and Bello apply priors to enforce sparsity in the activations, so that only a small number of components are active at any given time. Better still this means that you can start with a large number of latent components, and the activations for most of them drop to zero throughout the track as the learning algorithm proceeds, meaning that the effective number of underlying components can be learned rather than having to be specified in advance. Finally you can estimate the high-level repeat structure of the track by picking a simple viterbi path through the activations. I thought the paper was well worth the prize: an original approach, convincing results, clear presentation at the conference, and, best of all, published python code at

Ron Weiss also had a hand in the appealing Gordon Music Collection Database Management System available from This looks like a great lightweight framework for managing experiments based on audio feature extraction, with a python api, support for sql-like queries, automatic feature caching, and a clean web interface which includes automatic best-effort visualisations of the features for each track. I'm really looking forward to trying it out. The name Gordon apparently refers to the character in Nick Hornby's novel High Fidelity.

Opinion is increasingly divided about the pros and cons of using Flash on your website. If you still love Flash, then the Audio Processing Library for Flash, available from, now lets you do audio feature extraction in realtime, directly from your Flash movies. The developers even have a suprisingly funky example game, in which the graphics, and even some of the gameplay, are based directly on features extracted from the soundtrack as it plays: presumably this is one game which MIR researchers will always win!

Thursday, 12 August 2010

ISMIR 2010: recommendation

Brian McFee presented a poster on Learning Similarity from Collaborative Filters describing a method of addressing the cold start problem. The idea is to learn to rank tag representations of artists or tracks, so that the ranks agree with similarity scores given by collaborative filtering, using Metric Learning to Rank, an extension of the ranking SVM. This shows a modest improvement when applied to autotags inferred from audio, but really good results when applied to musicological tags scraped from Pandora. Conclusion: if you have musicological tags then you may not need CF data to get really good artist similarities.

Although at the moment musicological tags are probably even harder to come by than CF data, there are encouraging signs of progress in autotagging. A nice paper by James Bergstra, Mike Mandel and Doug Eck on Scalable Genre and Tag Prediction with Spectral Covariance shows impressive results with a scalable framework that learns from simple spectral features.

Along with several submissions to the MIREX evaluation tasks, this paper also illustrates another small trend at ISMIR 2010, which is a move away from known flawed features, in this case MFCCs used to represent musical timbre.

ISMIR 2010: crowdsourcing

Amazon Mechanical Turk seems to be gaining some ground as a way of collecting experimental groundtruth. A workshop on Creating Speech and Language Data with Amazon Mechanical Turk recently challenged researchers to come up with useful datasets for an outlay of $100, resulting in over 20 new datasets. A review paper by Chris Callison-Burch and Mark Dredze summarises some of the lessons learned.

Now Jin Ha Lee of the University of Washington has compared crowdsourced music similarity judgements from Mechanical Turk with those provided by experts for the carefully controlled MIREX Audio Music Similarity task. It took only 12 hours to crowdsource judgements compared to two weeks to gather them from experts, and, provided that the tasks were carefully designed, the results were almost identical: only 6 out of 105 pairwise comparisons between algorithms would have been reversed by using the crowdsourced judgements.

In another presentation Mike Mandel showed how social tags for short audio clips sourced from Mechanical Turk could be used for autotagging. He uses a Conditional restricted Boltzmann machine to model the cooccurrence of tags between different clips, with inputs representing the user who applied each tag, as well as the track, album and artist from which each clip was drawn. The learned weights are used to create a smoothed tag cloud from the tags provided for a clip by several users. When used as input to his autotagging classifiers, these smoothed tag clouds gave better results than simply aggregating tags across users, approaching the performance of classifiers trained on tags from the much more controlled MajorMiner game.

Wednesday, 11 August 2010

ISMIR 2010: applications

A team from Waseda University and KDDI labs are demoing a system that generates slideshows of photos from flickr from song lyrics. The images are chosen automatically for each line of the lyrics, relying in part on assigning a "general impression" for each line chosen from a small dictionary of concepts such as season of the year, weather and time of day. I'm not sure how generally applicable this system is, but the example slideshow I saw was suprisingly poetic, like the serendipitously beautiful sequences you sometimes see on the visual radio player (which only uses artist images).

Scott Miller presented an action-packed poster on a GeoShuffle iphone app which, amongst other things, recommends music based on geo location information. In a nice experiment GeoShuffle was given to a number of users for three weeks. During that time it created playlists for them, randomly switching between several different playlist generation methods every hour or so. Meanwhile their skip rate was recorded, to see which playlist generation algorithm they liked the best. A method that chose similar songs to those listened to while travelling on the same path was by far the best. Conclusion: humans are creatures of habit!

ISMIR 2010: music transcription

The Music Information Retrieval community is a pretty broad church, but the very much unsolved problem of transcribing a full score from audio input remains at its heart. Nowadays this is approached via various somewhat simpler subtasks, such as capturing just the harmony of the music as a series of chord labels. Most existing approaches to this use the chroma feature as the main input to their labelling algorithm. Chroma purports to show the amount of energy associated with each pitch class (note of the scale) in a given short time period, but it's well known to be highly flawed: if you resynthesize the pitches implied by the chroma it usually sounds nothing like the original audio.

A convincing presentation by Matthias Mauch, currently working in Masataka Goto's AIST lab in Japan, showed how you can improve on existing chord labelling performance by using the output of a multipitch transcription algorithm instead chroma. Mauch's labeller is fairly sophisticated (a Dynamic Bayes Net with a number of inputs as well as pitch evidence), but his pitch transcription algorithm is a very simple best-fit effort: expect a flurry of papers seeing if fancy state of the art methods can work even better than the current 80% accuracy on the MIREX test set of songs by The Beatles and others.

Two posters dealt with the misleadingly-named "octave error" problem in estimating the tempo (bpm) of music from audio input: state of the art beat trackers are good at finding regular pulses, but they often have trouble telling the beat from the half-beat. I really liked Jason Hockman's approach to solving this. Instead of making any changes to his beat tracker, he simply trains a classifier to predict whether a given track is slow or fast, using training examples which users have tagged as slow or fast. Despite using just a single aggregate feature vector for each track, which doesn't obviously contain any tempo information, his slow-fast classifier works really well, with an accuracy of over 95%. Presumably slow songs simply sound more like other slow songs than fast ones. I'll be interested to see how much this can help in tempo estimation. I'd expect that the classifier wouldn't work so well with songs that aren't obviously slow or fast enough to have been tagged as either, but I'd also guess that moderate tempo songs are less affected by octave error in the first place.