MIR in action: recommender

Showing posts with label recommender. Show all posts

Thursday, 22 November 2012

Making recommendations - efficiently

Anyone who's even vaguely followed the lively development of recommender systems since the Netflix Prize offered researchers $1M to find a way of improving movie recommendations will have noticed that most of the models proposed follow the same outline:

the input data is a large matrix with a row for each user and a column for each item
entries in the matrix represent ratings or preference scores
the model represents both users and items as vectors in a common low-dimensional space
after training unknown entries in the matrix can be predicted by taking the inner product of the corresponding user and item vectors

If you want to make recommendations for a user, step 4 lets you predict their rating for all possible items. You then simply return the few items with the highest predicted ratings.

Did you notice that one small but significant detail here has been "left as an exercise for the reader"? If you have any significant number of items in your catalogue then predicting and sorting ratings for all of them is going to take rather a long time - far too long for most practical applications.

An elegant paper by Parikshit Ram and Alexander Gray presented at KDD this year offers a simple solution to this problem. Maximum Inner-Product Search using Tree Data-Structure shows how to construct and use a ball tree designed specifically to find vectors maximising the inner product with one or more queries. Ram and Gray start from the observation that this problem is similar but different to that of finding nearest neighbours, the difference being that the inner product is not a genuine distance metric as it doesn't satisfy the triangle inequality.

I've written a very simple Python implementation of their proposed MIP Tree which you can find on github. It's not intended as production code but still shows speedups of several orders of magnitude when I've tried it on real latent factor datasets. Feel free to have a play with it to generate some recommendations - quite efficiently!

Thursday, 13 September 2012

Collaborative filtering still rules?

I just found time to catch up with the result of the Million Song Dataset Challenge, the aim of which was to build the best music recommendation algorithm. What was original about the challenge was its rather large and realistic dataset, which included detailed listening histories for over a million listeners. The winner was Fabio Aiolli of Padova University. Despite being a real machine learning guy, Fabio's winning submission was about as simple as you can get: memory-based Collaborative Filtering, which is pretty much the grand daddy of all recommendation techniques.

I imagine that this result came as a rather nasty surprise for the competition organisers, who went to some trouble to make additional sources of data available, with the specific aim of enabling more complex approaches such as using audio analysis or taking advantage of artist metadata and so on. The same may go for those who raised loud objections to last year's KDD recommendation challenge, which used an anonymized dataset from Yahoo! intended to make it impossible even to identify the artists of the tracks involved, let alone get hold of the audio.

There is a reason though, why we probably shouldn't be so surprised by this result. This was a recommendation challenge with a twist: it didn't actually involve recommendation (something I mentioned as an industry advisor to the competition, though even I didn't quite foresee the consequences for the results). To evaluate a track shown to a user as a recommendation you'd want to know roughly the following:

they hadn't listened to it before
they liked it
they weren't about to listen to it anyway

The design of the competition meant that all of these questions were completely disregarded, mainly because no timestamps were provided in the listening data released to the organisers.

Instead the challenge was simply to predict which tracks a user had already listened to, given a number of other tracks which they had already listened to. In other words the aim was to model how "users who listen to track A also listen to track B", which is exactly what classical Collaborative Filtering is known to do rather well.

This sort of model can be a central component of a recommendation system, even if on its own it doesn't provide recommendations, so it's still interesting to see that classic CF appears to do this better than anything else, and to see a few tricks that Fabio used to fit the data as tightly as possible:

He used an asymmetric form of cosine distance when comparing items \begin{equation} w_{ij} = \frac{|U(i) \cap U(j)|}{|U(i)|^\alpha |U(j)|^{(1 - \alpha)}} \end{equation} where $U(i)$ is the set of users who have listened to item $i$. Setting $\alpha = 0$ works pretty well.
He applied an exponential function when aggregating these distances for a candidate item \begin{equation} h_{ui} = \sum_{j \in X(u)}{w_{ij}^q} \end{equation} where $X(u)$ is the set of tracks listened to by user $u$ and $h_{ui}$ is the score for a potential recommendation $i$. Setting $q = 3$ works well. This was the single most important tweak.
He blended these item-based recommendations with similar user-based ones. This made a small improvement, but the item-based recommender did most of the work.

Tuesday, 5 October 2010

RecSys 2010: YouTube

The YouTube team had a poster at RecSys descibing their recommender in some detail. The design is intentionally simple, and apparently entirely implemented as a series of MapReduce jobs.

They first compute fixed-length lists of related videos, based in principle on simple co-occurrence counts in a short period i.e. given a video i, the count for a candidate related video j could be the number of users who viewed both i and j within the last 24 hours. The counts are normalised to take into account the relative popularity of different videos, and no doubt massaged in numerous other ways to remove noise and bias. As the paper says, "this is a simplified description".

At recommendation time they build a set of seed videos representing the query user, based on the user's views, favourites, playlists, etc. They then assemble a candidate pool containing the related videos for all of the seeds. If the pool is too small, they expand it by adding the related videos of all the videos already in the pool, though always keeping track of the original seed video for messaging purposes. The candidates in the pool are reranked based on a linear combination of values expressing the popularity of a given candidate video, the importance of its seed to the user, the overall popularity of the candidate and its freshness. Finally the recommended videos are diversified, using simple constraints on the number of recommended videos that can be associated with any one seed, or have been uploaded by any one user. Diversification is particularly important as related videos are typically very tightly associated with their seed.

Precomputed recommendations are cached and served up a few at a time to a user each time they visit the site. Each recommendation is easily associated with an explanation based on its seed video: "recommended because you favorited abc". While this system isn't going to win any best paper prizes it is certainly effective: 60% of all video clicks from the YouTube homepage are for recommendations.

RecSys 2010: social recommendations

Mohsen Jamali won the best paper award at RecSys for his presentation on SocialMF, a model-based recommender designed to improve ratings-based recommendations for users who have made few ratings but who have friends, or friends of friends, who have provided plenty. Mohsen's previous solution to the same problem was TrustWalker. TrustWalker predicts ratings from a set of item-item similarities and a social graph, where nodes are users and edges represent trust or friendship relationships. The rating for item i for user u is predicted by taking a short random walk on the graph, stopping at some friend-of-a-friend v and returning v's rating for item j, where j is the most similar item to i which v has rated. Closed form expressions for these predictions don't scale at all well, so to make a prediction TrustWalker actually executes the random walk a few times and returns the average rating. On a test set of ratings by 50k users for 100k product reviews from the Epinions website, TrustWalker does indeed show a significant benefit in both coverage and prediction accuracy for cold start users over baseline methods that don't leverage the social graph.

SocialMF is a Bayesian model-based solution to the same problem: latent factors for all users and items are learned jointly from ratings and the social graph. Ratings for cold start users can then be predicted from their learned factors. When tested on the epinions dataset, and a new one of ratings by 1M users for 50k movies crawled from Flixster, SocialMF again does indeed improve the accuracy of predicted ratings for cold start users over latent factor models that don't take the social graph into account.

The model-based approach is elegant and perhaps even scalable: learning time is linear in the number of users, and the paper reports a runtime of 5.5 hours for 1M users. But it lacks the powerful explanations of the simpler system: "recommended for you because your friend xyz likes it".

Thursday, 30 September 2010

RecSys 2010: industry

One striking feature of this year's ACM Recommender Systems conference was a large industry presence: from SAP, Bloomberg, IBM and Sony, to Yahoo! and Google, as well as a host of smaller outfits. I particularly enjoyed comparing notes with the teams from YouTube, Microsoft Bing, and the b2b recommendation company MediaUnbound (now Rovi), which has provided music recommendation services for MTV and Ericsson. Some clear themes emerged from those conversations, as well as the various panels in which we were speaking.

Some researchers in the field might be surprised to hear that the predominant industry view is that predicting ratings for items isn't a very useful way to build or evaluate a recommender system. This is partly because "users don't get ratings", as Palash Nandy from Google put it at the opening panel, "they always rate 1 or 5", and partly because improving prediction of users' ratings is at best weakly aligned with the business aims of most systems. In fact all of the industry people I spoke to were suspicious of offline evaluation metrics for recommendations in general. The YouTube team had even compared some standard offline metrics evaluating two recommendation algorithms with results from online controlled experiments on their live system. They found no correlation between the two: this means that the offline metrics had no value whatsoever in telling which algorithm would do better in practice. Finally there was a strong industry concern that the design of the Netflix Prize competition had led many researchers into developing methods that could not possibly scale to real world scenarios. Most industry people I spoke to described scalability as their number one consideration.

There were two strong beliefs that I heard stated many times in the last few days: improvements in the overall user experience dominate likely enhancements of current algorithms; and recommendations from friends remain far more powerful than those from machines. So the sort of research that will excite industry in the near future is likely to focus on topics such as:

providing more compelling explanations for machine recommendations
exploiting social relationships to make better recommendations
optimising the user's interaction with particular recommenders

Where does this leave the academic researcher with good ideas but no access to real users? This question was raised eloquently by Jo Konstan from the pioneering GroupLens team at the University of Minnesota, in a panel debating the value of competitions for scientific research. Jo's plea was for someone to offer researchers not a big cash prize or even a huge dataset, but rather an experimental platform where their systems can be deployed on a modest group of users under controlled conditions, the uptake of their recommendations recorded, and users in the test group asked directly to give feedback. This approach would meet most of industry's concerns about academic research head on. But it raises obvious issues of privacy, security, maintenance, and trust between businesses and their customers - probably enough to scare away even the most research-friendly commercial organisation.

One final thing that might interest researchers with concerns about any of these issues: pretty much everyone said that they were hiring.

Tuesday, 23 February 2010

Adsorption: scalable graph-based everything

Graph-based machine learning algorithms have a reputation for scaling badly, so I've enjoyed reading two papers describing a simple graph algorithm that does scale easily and which can be applied to a large range of problems. Video suggestion and discovery for YouTube: taking random walks through the view graph by Shumeet Baluja and seven colleagues from Google, and Ranking and semi-supervised classification on large scale graphs using map-reduce by Delip Rao and David Yarowsky of Johns Hopkins University, both show how Label Propagation has a simple implementation in the map-reduce framework which is closely related to PageRank.

Imagine a graph in which you have labels only for some nodes, but where the edges and their associated weights show the connections between related nodes and their relative strengths. Label Propagation allows you to infer labels for the remaining unlabelled nodes, or additional labels for those already labelled. The Google paper considers a graph in which nodes represent videos and users, and edges between them represent views on YouTube. User nodes are labelled with the videos they have viewed. We can generate recommendations for a user by propagating additional labels from other users connected to them by many short paths i.e. users who have viewed some of the same videos.

The basic algorithm is so simple that, using the elegant dumbo python wrapper for Hadoop, we can write the whole thing in a few lines of code.

Let's suppose we have a log of fruit consumption here at Last.fm HQ:

 norman  orange  1

norman  orange  1

mark    apple   1

klaas   orange  1

mark    banana  1

mark    apple   1

mark    apple   1

norman  banana  1

klaas   pear    1

ricky   banana  1

olivier cherry  1

norman  orange  1

klaas   cherry  1

olivier banana  1

First of all let's create a graph from this logfile. We map the log entries:

def parse_log(value):

    user,item,count = value.split("\t")

    return user,item,int(count)

   

def map_log(key,value):

    user,item,count = parse_log(value)

    yield ((USER,user),(ITEM,item)),count

    yield ((ITEM,item),(USER,user)),count

and sum the counts with a dumbo.sumreducer to create the edges. Next we map the edges

def map_edges(key,value):

    yield key[0],(key[1],value)

In the reducer we output adjacency lists, also adding a shadow or dummy node for each user, to which the observed label distributions are clamped. We use the dumbo getpath decorator to output the adjacency lists and label distributions into different directories:

@opt("getpath", "yes")

def reduce_edges(key,values):

    values = list(values)

    yield ("adjacency_lists",key),values

    if key[0] == USER:

        yield ("adjacency_lists",(DUMMY_USER,key[1])),[(key,INJECTION_WEIGHT)]

        dist = normalise_dist(values,MAX_LABELS_PER_NODE)

        yield ("label_dists",(DUMMY_USER,key[1])),label_dist

Here's the function we use to normalise and prune label distributions, and the dumbo runner for this job:

def normalise_dist(dist,max_labels):

    dist = sorted(dist,key=itemgetter(1),reverse=True)[:max_labels]

    norm = sum(weight for label,weight in dist)

    return [(label,weight/norm) for label,weight in dist]



def runner(job):

    job.additer(map_log,sumreducer)

    job.additer(map_edges,reduce_edges)

Now we're ready to propagate the label distributions. We do this iteratively, on each iteration sending the distribution at each node to each of its neighbours. First we use some dumbo magic to join distributions to adjacency lists. We just need to write a reducer:

class Reducer(JoinReducer):

    def primary(self,key,values):

        self.label_dist = values.next()

    def secondary(self,key,values):

        yield key,(self.label_dist,values.next())

Then we transmit the distribution from each node to its neighbours:

def map_propagate(key,value):

    label_dist,adjacency_list = value

    for node,weight in adjacency_list:

        yield node, [(label,prob*weight) for label,prob in label_dist]

Finally we sum and normalise the incoming distributions at each receiving node:

def reduce_sum_normalise(key,values):

    dist = defaultdict(lambda:0)

    for d in values:

        for label,prob in d:

            dist[label] += float(prob)

    dist = normalise_dist(dist.items(),MAX_LABELS_PER_NODE)

    yield key,dist

Here's the runner for this job:

def runner(job):

    multimapper = MultiMapper()

    multimapper.add("label_dists", primary(identitymapper))

    multimapper.add("adjacency_lists", secondary(identitymapper))

    job.additer(multimapper,Reducer)

    job.additer(map_propagate,reduce_sum_normalise)

And that's it: a massively scalable graph-based recommender in under 50 lines of python. Well ok, we still need to interpret the ouput. We started with an observed distribution at the dummy node for Norman like this:

(3, 'norman') [((2, 'orange'), 0.75), ((2, 'banana'), 0.25)]

and after a few iterations our recommender has inferred this at his node:

(1, 'norman')   [((2, 'orange'), 0.45735654387297453), ((2, 'banana'), 0.28646536729600702), ((2, 'cherry'), 0.11646859085648147), ((2, 'apple'), 0.074826725260416671), ((2, 'pear'), 0.064882772714120379)]

So next lunchtime I'll suggest he tries some cherries.

Thursday, 18 February 2010

Integrating social relationships into a music recommender

It's always nice to see people working with Last.fm's huge quantity of publicly available data, especially when they include several researchers from IBM's famous research labs. In their ACM Recommender Systems 2009 paper Augmenting collaborative recommender by fusing explicit social relationships, Q. Yuan, S. Zhao, L. Chen, Y. Liu, S. Ding, X. Zhang and W. Zheng report some experiments incorporating Last.fm's friend and group membership relationships into more conventional Collaborative Filtering systems.

Their first recommender is a user-based CF system i.e. the score given to a new recommendation of Muse for user Jim is a weighted average over all the scores (ratings, scrobbles, etc.) for users who already listen to Muse, and the weights are similarities computed between Jim and each of Muse's fans.

Yuan et al. start with a baseline CF recommender in which the similarity between users is based on cosine distance between their vectors of artist scrobbles. For reasons that aren't obvious to me, the scrobble counts are first mapped to binary values, e.g. 1 if a user has scrobbled Muse three or more times, 0 otherwise. In their first experimental recommender, these similarities are augmented by a weighted combination of similarities based on friend and group membership relationships. The relationship similarities are also computed as cosine distances between binary vectors, so the friendship similarity between two users expresses the extent to which they have friends in common, and likewise for group membership.

The second experimental system is a graph-based recommender, built on a graph of user, artist and group nodes. The graph starts with the obvious edges between users and the artists they listen to, and the groups they belong to. Then edges are added between pairs of users whose relationship similarities are more than some threshold, so each user is connected to other users with similar friends or belonging to similar groups as well as to their 'real' friends. Like the binary CF vectors in the first system, the edge weights are all set to 1. Finally user-user similarity scores are extracted from the graph using a Laplacian matrix pseudoinverse algorithm: in practice the computation required is similar to inverting a matrix containing concatenated artist, friend and group vectors for each user.

After these nice simple suggestions for combining social relationships and CF, the experiments themselves are a bit of a disappointment, mainly because the dataset they use contains only 1000 artists and users. The evaluation metrics are precision, recall and f-value over the top few recommendations, where a recommendation is considered correct if the user actually scrobbled the reommended artist at least three times. In the CF recommender, using weights that seem to be rather highly fitted to the small dataset, there is a 2-3% improvement in f-value over the top five recommendations. The graph recommender shows an 8% improvement, but they don't give baseline results from a graph not including the social relationships, so it's not clear whether this is due to the extra social information or just the different algorithm.

In fact using such a small dataset raises all sorts of problems in interpreting the results:

it's hard to tell if absolute values are good or bad
is it sensible to evaluate the top-5 recommendations from a pool of only 1000 artists?
are the improvements significant?
are the various parameters overfitted?
does the graph method scale?

It makes me suspect that IBM are just playing in the recommender space, while others are clearly serious about it.

MIR in action