I imagine that this result came as a rather nasty surprise for the competition organisers, who went to some trouble to make additional sources of data available, with the specific aim of enabling more complex approaches such as using audio analysis or taking advantage of artist metadata and so on. The same may go for those who raised loud objections to last year's KDD recommendation challenge, which used an anonymized dataset from Yahoo! intended to make it impossible even to identify the artists of the tracks involved, let alone get hold of the audio.
There is a reason though, why we probably shouldn't be so surprised by this result. This was a recommendation challenge with a twist: it didn't actually involve recommendation (something I mentioned as an industry advisor to the competition, though even I didn't quite foresee the consequences for the results). To evaluate a track shown to a user as a recommendation you'd want to know roughly the following:
- they hadn't listened to it before
- they liked it
- they weren't about to listen to it anyway
Instead the challenge was simply to predict which tracks a user had already listened to, given a number of other tracks which they had already listened to. In other words the aim was to model how "users who listen to track A also listen to track B", which is exactly what classical Collaborative Filtering is known to do rather well.
This sort of model can be a central component of a recommendation system, even if on its own it doesn't provide recommendations, so it's still interesting to see that classic CF appears to do this better than anything else, and to see a few tricks that Fabio used to fit the data as tightly as possible:
- He used an asymmetric form of cosine distance when comparing items \begin{equation} w_{ij} = \frac{|U(i) \cap U(j)|}{|U(i)|^\alpha |U(j)|^{(1 - \alpha)}} \end{equation} where \(U(i)\) is the set of users who have listened to item \(i\). Setting \(\alpha = 0\) works pretty well.
- He applied an exponential function when aggregating these distances for a candidate item \begin{equation} h_{ui} = \sum_{j \in X(u)}{w_{ij}^q} \end{equation} where \(X(u)\) is the set of tracks listened to by user \(u\) and \(h_{ui}\) is the score for a potential recommendation \(i\). Setting \(q = 3\) works well. This was the single most important tweak.
- He blended these item-based recommendations with similar user-based ones. This made a small improvement, but the item-based recommender did most of the work.