Thursday, 11 July 2013

New Blog: Data Science in Action

A couple of months ago I left to join the data science team at Mendeley, meaning that for pretty much the first time in my life I'm not officially working on anything to do with music.  So it seemed time to get a new name for my blog.  Although I'm not personally convinced that "data scientist" actually means anything in particular, it's in my new job title, and of course the Harvard Business Review recently called it "the sexiest job of the 21st century", so I've gone with the flow.

If you've found anything of interest here then please follow me over to

Thursday, 18 April 2013

Contributing to GraphChi

GraphChi is a framework for processing large graphs efficiently on a single machine, developed by Aapo Kyrölä of CMU as a spin off from the impressive GraphLab distributed graph processing project.  Both GraphLab and GraphChi come with a really handy Collaborative Filtering Toolbox, implementing numerous recent algorithms and developed at CMU by Danny Bickson.

GraphChi looks like a great project so I decided to try to contribute to it and took the chance to implement an algorithm that I'd been wanting to investigate with for a while: Collaborative Less-is-More Filtering, developed by Yue Shi at TU Delft, which won a best paper award at RecSys last year.  CLiMF optimises for the Mean Reciprocal Rank of correctly predicted items i.e. it's designed to promote accuracy and diversity in recommendations at the same time.  Although it's really intended for binary preference data like follow or friend relations, it's easy to implement a threshold on ratings that automatically binarises them during learning, so CLiMF can also be used with ratings datasets.

Danny made contributing to the toolbox really easy and CLiMF is now available in GraphChi, and documented alongside the other algorithms.  I also wrote a simple Python implementation which works fine for small datasets and which was useful for reference.

You can get the latest version of GraphChi and the collaborative filtering toolbox from here.

Thursday, 21 March 2013

Hadoop and beyond: power tools for data mining

Last week Dell Zhang kindly invited me to give a guest lecture to Birkbeck and UCL students on his Cloud Computing course.  It was fun to show them some of the tools that make Hadoop development easier and more maintainable, and also some of the problems for which Hadoop is not a magic bullet.

I took the chance to evangelise about some of my favourite tools and frameworks, and to spend a bit of time getting to know some new ones.  I specially enjoyed taking a look at Scalding, Spark and GraphChi.

Here are the slides:


Thursday, 22 November 2012

Making recommendations - efficiently

Anyone who's even vaguely followed the lively development of recommender systems since the Netflix Prize offered researchers $1M to find a way of improving movie recommendations will have noticed that most of the models proposed follow the same outline:

  1. the input data is a large matrix with a row for each user and a column for each item
  2. entries in the matrix represent ratings or preference scores
  3. the model represents both users and items as vectors in a common low-dimensional space
  4. after training unknown entries in the matrix can be predicted by taking the inner product of the corresponding user and item vectors

If you want to make recommendations for a user, step 4 lets you predict their rating for all possible items.  You then simply return the few items with the highest predicted ratings.

Did you notice that one small but significant detail here has been "left as an exercise for the reader"?  If you have any significant number of items in your catalogue then predicting and sorting ratings for all of them is going to take rather a long time - far too long for most practical applications.

An elegant paper by Parikshit Ram and Alexander Gray presented at KDD this year offers a simple solution to this problem.  Maximum Inner-Product Search using Tree Data-Structure shows how to construct and use a ball tree designed specifically to find vectors maximising the inner product with one or more queries.  Ram and Gray start from the observation that this problem is similar but different to that of finding nearest neighbours, the difference being that the inner product is not a genuine distance metric as it doesn't satisfy the triangle inequality.

I've written a very simple Python implementation of their proposed MIP Tree which you can find on github.  It's not intended as production code but still shows speedups of several orders of magnitude when I've tried it on real latent factor datasets.  Feel free to have a play with it to generate some recommendations - quite efficiently!

Thursday, 13 September 2012

Collaborative filtering still rules?

I just found time to catch up with the result of the Million Song Dataset Challenge, the aim of which was to build the best music recommendation algorithm. What was original about the challenge was its rather large and realistic dataset, which included detailed listening histories for over a million listeners. The winner was Fabio Aiolli of Padova University. Despite being a real machine learning guy, Fabio's winning submission was about as simple as you can get: memory-based Collaborative Filtering, which is pretty much the grand daddy of all recommendation techniques.

I imagine that this result came as a rather nasty surprise for the competition organisers, who went to some trouble to make additional sources of data available, with the specific aim of enabling more complex approaches such as using audio analysis or taking advantage of artist metadata and so on. The same may go for those who raised loud objections to last year's KDD recommendation challenge, which used an anonymized dataset from Yahoo! intended to make it impossible even to identify the artists of the tracks involved, let alone get hold of the audio.

There is a reason though, why we probably shouldn't be so surprised by this result. This was a recommendation challenge with a twist: it didn't actually involve recommendation (something I mentioned as an industry advisor to the competition, though even I didn't quite foresee the consequences for the results). To evaluate a track shown to a user as a recommendation you'd want to know roughly the following:
  • they hadn't listened to it before
  • they liked it
  • they weren't about to listen to it anyway
The design of the competition meant that all of these questions were completely disregarded, mainly because no timestamps were provided in the listening data released to the organisers.

Instead the challenge was simply to predict which tracks a user had already listened to, given a number of other tracks which they had already listened to. In other words the aim was to model how "users who listen to track A also listen to track B", which is exactly what classical Collaborative Filtering is known to do rather well.

This sort of model can be a central component of a recommendation system, even if on its own it doesn't provide recommendations, so it's still interesting to see that classic CF appears to do this better than anything else, and to see a few tricks that Fabio used to fit the data as tightly as possible:
  • He used an asymmetric form of cosine distance when comparing items \begin{equation} w_{ij} = \frac{|U(i) \cap U(j)|}{|U(i)|^\alpha |U(j)|^{(1 - \alpha)}} \end{equation} where \(U(i)\) is the set of users who have listened to item \(i\). Setting \(\alpha = 0\) works pretty well.
  • He applied an exponential function when aggregating these distances for a candidate item \begin{equation} h_{ui} = \sum_{j \in X(u)}{w_{ij}^q} \end{equation} where \(X(u)\) is the set of tracks listened to by user \(u\) and \(h_{ui}\) is the score for a potential recommendation \(i\). Setting \(q = 3\) works well. This was the single most important tweak.
  • He blended these item-based recommendations with similar user-based ones. This made a small improvement, but the item-based recommender did most of the work.

Wednesday, 2 November 2011

ISMIR 2011: lessons from speech and image processing

Cool ideas from speech and image processing tend to make their way gradually into music-related research, and a few of them were on display at ISMIR.

The Best Student Paper prize went to Mikael Henaff from NYU's Courant Institute for his research applying Predictive Sparse Decomposition to audio spectrograms, to learn sparse features for classification. This uses some nice techniques that were new to me, firstly to sharpen the spectrogram, and then to learn basis functions efficiently by gradient descent. Not only do the resulting features give impressive classification accuracy even with a relatively simple classifier, but - as you can see from the figure on the right - many of them appear to learn musical intervals and chords directly from the spectrogram.

The music similarity system submitted to MIREX by Geoffroy Peeters' team from IRCAM models each track as a Gaussian Mixture Model supervector. This representation is explained in a paper from the DAFX conference from a few weeks back, and borrows from some recent methods in voice recognition. You start by training a Universal Background Model, which represents the sound of "music in general". Any particular track can then be represented as a vector of differences from the UBM, which are learned by a simple iterative process. These resulting "supervectors" can then be efficiently compared to one another with Euclidean distance.

ISMIR 2011: music and lyrics

There were only a few papers about lyrics at ISMIR this year but some of them were really interesting.

Matt McVicar tested a rather profound hypothesis about the relationship of music and lyrics: that the common determining factor for music and words is the mood that the songwriter and composer hope to create together. Matt showed how principal axes apparently representing the valence (sad-happy) and arousal (calm-excited) scales commonly used to describe emotions fall directly out of a Canonical Correlation Analysis applied to lyrics and audio features for 120,000 songs. As Matt admitted in his presentation, the principal axes found by CCA in the audio space are very highly correlated with energy and loudness, so what's been found so far may simply show the association of happy, active lyrics with loud music and vice versa. But this is an elegant line of research that I'd enjoy seeing pursued in more depth.

Xiao Hu from the University of Denver looked just at the lyrics of 2,700 songs from the MIREX Mood Tag Dataset, which have mood annotations based on tags. Her investigations concern the relationship of mood and creativity, defined in terms of some simple textual features mainly concerning the size and diversity of vocabulary used in a song. In a nutshell it seems that there are more ways to feel sad than there are to feel happy. But maybe you knew that already.

Another takehome message for me was that even simple textual features may turn out to be pretty useful for some music classification and prediction tasks. Rudolf Mayer and Andreas Rauber of TUW report some fancy methods of combining features for genre classification. They see a hefty increase in accuracy when using statistical features to summarise the style of lyrics in addition to audio features, presumably because musical genres have their own characteristic lyrical styles and poetic forms too.