Thursday, 13 September 2012

Collaborative filtering still rules?

I just found time to catch up with the result of the Million Song Dataset Challenge, the aim of which was to build the best music recommendation algorithm. What was original about the challenge was its rather large and realistic dataset, which included detailed listening histories for over a million listeners. The winner was Fabio Aiolli of Padova University. Despite being a real machine learning guy, Fabio's winning submission was about as simple as you can get: memory-based Collaborative Filtering, which is pretty much the grand daddy of all recommendation techniques.

I imagine that this result came as a rather nasty surprise for the competition organisers, who went to some trouble to make additional sources of data available, with the specific aim of enabling more complex approaches such as using audio analysis or taking advantage of artist metadata and so on. The same may go for those who raised loud objections to last year's KDD recommendation challenge, which used an anonymized dataset from Yahoo! intended to make it impossible even to identify the artists of the tracks involved, let alone get hold of the audio.

There is a reason though, why we probably shouldn't be so surprised by this result. This was a recommendation challenge with a twist: it didn't actually involve recommendation (something I mentioned as an industry advisor to the competition, though even I didn't quite foresee the consequences for the results). To evaluate a track shown to a user as a recommendation you'd want to know roughly the following:
  • they hadn't listened to it before
  • they liked it
  • they weren't about to listen to it anyway
The design of the competition meant that all of these questions were completely disregarded, mainly because no timestamps were provided in the listening data released to the organisers.

Instead the challenge was simply to predict which tracks a user had already listened to, given a number of other tracks which they had already listened to. In other words the aim was to model how "users who listen to track A also listen to track B", which is exactly what classical Collaborative Filtering is known to do rather well.

This sort of model can be a central component of a recommendation system, even if on its own it doesn't provide recommendations, so it's still interesting to see that classic CF appears to do this better than anything else, and to see a few tricks that Fabio used to fit the data as tightly as possible:
  • He used an asymmetric form of cosine distance when comparing items \begin{equation} w_{ij} = \frac{|U(i) \cap U(j)|}{|U(i)|^\alpha |U(j)|^{(1 - \alpha)}} \end{equation} where \(U(i)\) is the set of users who have listened to item \(i\). Setting \(\alpha = 0\) works pretty well.
  • He applied an exponential function when aggregating these distances for a candidate item \begin{equation} h_{ui} = \sum_{j \in X(u)}{w_{ij}^q} \end{equation} where \(X(u)\) is the set of tracks listened to by user \(u\) and \(h_{ui}\) is the score for a potential recommendation \(i\). Setting \(q = 3\) works well. This was the single most important tweak.
  • He blended these item-based recommendations with similar user-based ones. This made a small improvement, but the item-based recommender did most of the work.


  1. Hi Mark,
    Great summary of the challenge and the winning solution.

    Yes, organizers might be a little disappointed that pure CF was a clear winner, and yes, we were warned that this is not a real recommendation setting.

    A few remarks though, in no particular order:
    - it still is nearly impossible to do real recommendations without an A/B setup or an extensive large-scale survey;
    - content would be more useful in a real recommendation setting, but there should still be correlations between what someone listens to and audio attributes, or tags, or... it might just be more hidden;
    - that no winner used content is not a complete proof that content is useless, we might just not have found the right way to use it;
    - time stamps would indeed have been great, but do you have some number or intuition that tell us how important they are? i.e. how much more correlated is a song with the previously listened one, as opposed to all other songs listened to;
    - there might have been... "contestant bias"? in the sense that no big MIR team seriously worked on the contest, it was mostly ML folks;
    - we got a solution that is probably within epsilon of the CF state-of-the-art, it's great knowledge for people that now want to publish on content-based recommendation and want to compare their results with CF; and eventually content might improve that result by a few percents;

    Slightly related, I'd like to point to Amelie Anglade's question on the late-breaking demo page (, what's up with music recommendation at MIR? does the community still care, or should people interested in this go to RecSys? I'm not saying that not participating in the MSD Challenge means one does not care about music recommendation. I'm saying there might be a lack of alternative options.

  2. Hi Thierry,

    Nothing beats live A/B testing, but I think you could use listening histories with timestamps to get a decent groundtruth for useful recommendations.

    Suppose you have a training set that exposes my listening history up to time T. If you were able to recommend me a track that had been released before T, but that I didn't discover until time T + 6 months (say), but which I then played a lot, that would have been a good recommendation! And you can find tracks like that from listening histories plus release dates - as long as you have time stamps.

    My point here isn't that simple CF is the be all and end all for music recommendation, only that this competition set out to measure exactly the thing that CF is good at. I know from daily experience that there's still a lot of room for improvement in practical music recommendation systems. I think other people still care - I certainly hope so!

  3. Gee Mark, are you volunteering to get us listening histories with timestamps? :-)

    Just kidding. Sort of.

    Seriously though, I'm not at all disappointed in the outcome. As Thierry says, we now have a solid set of benchmarks on this data set. This does leave open the challenge of how meta data and content can be used to fill in the gaps, and I sincerely hope that some enterprising mir team goes for it in round two.