Thursday, 22 November 2012

Making recommendations - efficiently

Anyone who's even vaguely followed the lively development of recommender systems since the Netflix Prize offered researchers $1M to find a way of improving movie recommendations will have noticed that most of the models proposed follow the same outline:

  1. the input data is a large matrix with a row for each user and a column for each item
  2. entries in the matrix represent ratings or preference scores
  3. the model represents both users and items as vectors in a common low-dimensional space
  4. after training unknown entries in the matrix can be predicted by taking the inner product of the corresponding user and item vectors

If you want to make recommendations for a user, step 4 lets you predict their rating for all possible items.  You then simply return the few items with the highest predicted ratings.

Did you notice that one small but significant detail here has been "left as an exercise for the reader"?  If you have any significant number of items in your catalogue then predicting and sorting ratings for all of them is going to take rather a long time - far too long for most practical applications.

An elegant paper by Parikshit Ram and Alexander Gray presented at KDD this year offers a simple solution to this problem.  Maximum Inner-Product Search using Tree Data-Structure shows how to construct and use a ball tree designed specifically to find vectors maximising the inner product with one or more queries.  Ram and Gray start from the observation that this problem is similar but different to that of finding nearest neighbours, the difference being that the inner product is not a genuine distance metric as it doesn't satisfy the triangle inequality.

I've written a very simple Python implementation of their proposed MIP Tree which you can find on github.  It's not intended as production code but still shows speedups of several orders of magnitude when I've tried it on real latent factor datasets.  Feel free to have a play with it to generate some recommendations - quite efficiently!

Thursday, 13 September 2012

Collaborative filtering still rules?

I just found time to catch up with the result of the Million Song Dataset Challenge, the aim of which was to build the best music recommendation algorithm. What was original about the challenge was its rather large and realistic dataset, which included detailed listening histories for over a million listeners. The winner was Fabio Aiolli of Padova University. Despite being a real machine learning guy, Fabio's winning submission was about as simple as you can get: memory-based Collaborative Filtering, which is pretty much the grand daddy of all recommendation techniques.

I imagine that this result came as a rather nasty surprise for the competition organisers, who went to some trouble to make additional sources of data available, with the specific aim of enabling more complex approaches such as using audio analysis or taking advantage of artist metadata and so on. The same may go for those who raised loud objections to last year's KDD recommendation challenge, which used an anonymized dataset from Yahoo! intended to make it impossible even to identify the artists of the tracks involved, let alone get hold of the audio.

There is a reason though, why we probably shouldn't be so surprised by this result. This was a recommendation challenge with a twist: it didn't actually involve recommendation (something I mentioned as an industry advisor to the competition, though even I didn't quite foresee the consequences for the results). To evaluate a track shown to a user as a recommendation you'd want to know roughly the following:
  • they hadn't listened to it before
  • they liked it
  • they weren't about to listen to it anyway
The design of the competition meant that all of these questions were completely disregarded, mainly because no timestamps were provided in the listening data released to the organisers.

Instead the challenge was simply to predict which tracks a user had already listened to, given a number of other tracks which they had already listened to. In other words the aim was to model how "users who listen to track A also listen to track B", which is exactly what classical Collaborative Filtering is known to do rather well.

This sort of model can be a central component of a recommendation system, even if on its own it doesn't provide recommendations, so it's still interesting to see that classic CF appears to do this better than anything else, and to see a few tricks that Fabio used to fit the data as tightly as possible:
  • He used an asymmetric form of cosine distance when comparing items \begin{equation} w_{ij} = \frac{|U(i) \cap U(j)|}{|U(i)|^\alpha |U(j)|^{(1 - \alpha)}} \end{equation} where \(U(i)\) is the set of users who have listened to item \(i\). Setting \(\alpha = 0\) works pretty well.
  • He applied an exponential function when aggregating these distances for a candidate item \begin{equation} h_{ui} = \sum_{j \in X(u)}{w_{ij}^q} \end{equation} where \(X(u)\) is the set of tracks listened to by user \(u\) and \(h_{ui}\) is the score for a potential recommendation \(i\). Setting \(q = 3\) works well. This was the single most important tweak.
  • He blended these item-based recommendations with similar user-based ones. This made a small improvement, but the item-based recommender did most of the work.