Tuesday, 23 February 2010

Adsorption: scalable graph-based everything

Graph-based machine learning algorithms have a reputation for scaling badly, so I've enjoyed reading two papers describing a simple graph algorithm that does scale easily and which can be applied to a large range of problems.  Video suggestion and discovery for YouTube: taking random walks through the view graph by Shumeet Baluja and seven colleagues from Google, and Ranking and semi-supervised classification on large scale graphs using map-reduce by Delip Rao and David Yarowsky of Johns Hopkins University, both show how Label Propagation has a simple implementation in the map-reduce framework which is closely related to PageRank.

Imagine a graph in which you have labels only for some nodes, but where the edges and their associated weights show the connections between related nodes and their relative strengths.  Label Propagation allows you to infer labels for the remaining unlabelled nodes, or additional labels for those already labelled.  The Google paper considers a graph in which nodes represent videos and users, and edges between them represent views on YouTube.  User nodes are labelled with the videos they have viewed.  We can generate recommendations for a user by propagating additional labels from other users connected to them by many short paths i.e. users who have viewed some of the same videos.

The basic algorithm is so simple that, using the elegant dumbo python wrapper for Hadoop, we can write the whole thing in a few lines of code.

Let's suppose we have a log of fruit consumption here at Last.fm HQ:

norman  orange  1
norman  orange  1
mark    apple   1
klaas   orange  1
mark    banana  1
mark    apple   1
mark    apple   1
norman  banana  1
klaas   pear    1
ricky   banana  1
olivier cherry  1
norman  orange  1
klaas   cherry  1
olivier banana  1

First of all let's create a graph from this logfile.  We map the log entries:

def parse_log(value):
    user,item,count = value.split("\t")
    return user,item,int(count)
def map_log(key,value):
    user,item,count = parse_log(value)
    yield ((USER,user),(ITEM,item)),count
    yield ((ITEM,item),(USER,user)),count
and sum the counts with a dumbo.sumreducer to create the edges.  Next we map the edges

def map_edges(key,value):
    yield key[0],(key[1],value)
In the reducer we output adjacency lists, also adding a shadow or dummy node for each user, to which the observed label distributions are clamped.  We use the dumbo getpath decorator to output the adjacency lists and label distributions into different directories:

@opt("getpath", "yes")
def reduce_edges(key,values):
    values = list(values)
    yield ("adjacency_lists",key),values
    if key[0] == USER:
        yield ("adjacency_lists",(DUMMY_USER,key[1])),[(key,INJECTION_WEIGHT)]
        dist = normalise_dist(values,MAX_LABELS_PER_NODE)
        yield ("label_dists",(DUMMY_USER,key[1])),label_dist
Here's the function we use to normalise and prune label distributions, and the dumbo runner for this job:

def normalise_dist(dist,max_labels):
    dist = sorted(dist,key=itemgetter(1),reverse=True)[:max_labels]
    norm = sum(weight for label,weight in dist)
    return [(label,weight/norm) for label,weight in dist]

def runner(job):

Now we're ready to propagate the label distributions.  We do this iteratively, on each iteration sending the distribution at each node to each of its neighbours.  First we use some dumbo magic to join distributions to adjacency lists.  We just need to write a reducer:

class Reducer(JoinReducer):
    def primary(self,key,values):
        self.label_dist = values.next()
    def secondary(self,key,values):
        yield key,(self.label_dist,values.next())

Then we transmit the distribution from each node to its neighbours:

def map_propagate(key,value):
    label_dist,adjacency_list = value
    for node,weight in adjacency_list:
        yield node, [(label,prob*weight) for label,prob in label_dist]

Finally we sum and normalise the incoming distributions at each receiving node:

def reduce_sum_normalise(key,values):
    dist = defaultdict(lambda:0)
    for d in values:
        for label,prob in d:
            dist[label] += float(prob)
    dist = normalise_dist(dist.items(),MAX_LABELS_PER_NODE)
    yield key,dist

Here's the runner for this job:

def runner(job):
    multimapper = MultiMapper()
    multimapper.add("label_dists", primary(identitymapper))
    multimapper.add("adjacency_lists", secondary(identitymapper))

And that's it: a massively scalable graph-based recommender in under 50 lines of python.  Well ok, we still need to interpret the ouput.  We started with an observed distribution at the dummy node for Norman like this:

(3, 'norman')   [((2, 'orange'), 0.75), ((2, 'banana'), 0.25)]

and after a few iterations our recommender has inferred this at his node:

(1, 'norman')   [((2, 'orange'), 0.45735654387297453), ((2, 'banana'), 0.28646536729600702), ((2, 'cherry'), 0.11646859085648147), ((2, 'apple'), 0.074826725260416671), ((2, 'pear'), 0.064882772714120379)]

So next lunchtime I'll suggest he tries some cherries.

Thursday, 18 February 2010

Integrating social relationships into a music recommender

It's always nice to see people working with Last.fm's huge quantity of publicly available data, especially when they include several researchers from IBM's famous research labs. In their ACM Recommender Systems 2009 paper Augmenting collaborative recommender by fusing explicit social relationships, Q. Yuan, S. Zhao, L. Chen, Y. Liu, S. Ding, X. Zhang and W. Zheng report some experiments incorporating Last.fm's friend and group membership relationships into more conventional Collaborative Filtering systems.

Their first recommender is a user-based CF system i.e. the score given to a new recommendation of Muse for user Jim is a weighted average over all the scores (ratings, scrobbles, etc.) for users who already listen to Muse, and the weights are similarities computed between Jim and each of Muse's fans.

Yuan et al. start with a baseline CF recommender in which the similarity between users is based on cosine distance between their vectors of artist scrobbles. For reasons that aren't obvious to me, the scrobble counts are first mapped to binary values, e.g. 1 if a user has scrobbled Muse three or more times, 0 otherwise. In their first experimental recommender, these similarities are augmented by a weighted combination of similarities based on friend and group membership relationships. The relationship similarities are also computed as cosine distances between binary vectors, so the friendship similarity between two users expresses the extent to which they have friends in common, and likewise for group membership.

The second experimental system is a graph-based recommender, built on a graph of user, artist and group nodes. The graph starts with the obvious edges between users and the artists they listen to, and the groups they belong to. Then edges are added between pairs of users whose relationship similarities are more than some threshold, so each user is connected to other users with similar friends or belonging to similar groups as well as to their 'real' friends. Like the binary CF vectors in the first system, the edge weights are all set to 1. Finally user-user similarity scores are extracted from the graph using a Laplacian matrix pseudoinverse algorithm: in practice the computation required is similar to inverting a matrix containing concatenated artist, friend and group vectors for each user.

After these nice simple suggestions for combining social relationships and CF, the experiments themselves are a bit of a disappointment, mainly because the dataset they use contains only 1000 artists and users. The evaluation metrics are precision, recall and f-value over the top few recommendations, where a recommendation is considered correct if the user actually scrobbled the reommended artist at least three times. In the CF recommender, using weights that seem to be rather highly fitted to the small dataset, there is a 2-3% improvement in f-value over the top five recommendations. The graph recommender shows an 8% improvement, but they don't give baseline results from a graph not including the social relationships, so it's not clear whether this is due to the extra social information or just the different algorithm.

In fact using such a small dataset raises all sorts of problems in interpreting the results:
  • it's hard to tell if absolute values are good or bad
  • is it sensible to evaluate the top-5 recommendations from a pool of only 1000 artists?
  • are the improvements significant?
  • are the various parameters overfitted?
  • does the graph method scale?
It makes me suspect that IBM are just playing in the recommender space, while others are clearly serious about it.