Tuesday, 5 October 2010

RecSys 2010: music

How do teenagers learn about new music?  Audrey LaPlante has spent some time actually asking them, and presented some of their answers at the Workshop on Music Recommendation and Discovery.

Some of her findings were extremely interesting.  All of the teenagers she spoke to said that their tastes had changed substantially over time, and that the changes were due to changes in their social network.  Most had music geek friends whom they actively consulted for music recommendations, even though they were not influential people in other respects.  Although close contacts were most likely to be sources of new music, those chosen to play that role were "almost always those whose social network were more different from theirs, mostly those who were going to a different school".

I'll be interested to follow Audrey's research and see how we can learn to make online social networks an equally great place for young people to discover new music.

RecSys 2010: YouTube

The YouTube team had a poster at RecSys descibing their recommender in some detail.  The design is intentionally simple, and apparently entirely implemented as a series of MapReduce jobs.

They first compute fixed-length lists of related videos, based in principle on simple co-occurrence counts in a short period i.e. given a video i, the count for a candidate related video j could be the number of users who viewed both i and j within the last 24 hours.  The counts are normalised to take into account the relative popularity of different videos, and no doubt massaged in numerous other ways to remove noise and bias.  As the paper says, "this is a simplified description".

At recommendation time they build a set of seed videos representing the query user, based on the user's views, favourites, playlists, etc.  They then assemble a candidate pool containing the related videos for all of the seeds.  If the pool is too small, they expand it by adding the related videos of all the videos already in the pool, though always keeping track of the original seed video for messaging purposes.  The candidates in the pool are reranked based on a linear combination of values expressing the popularity of a given candidate video, the importance of its seed to the user, the overall popularity of the candidate and its freshness.  Finally the recommended videos are diversified, using simple constraints on the number of recommended videos that can be associated with any one seed, or have been uploaded by any one user.  Diversification is particularly important as related videos are typically very tightly associated with their seed.

Precomputed recommendations are cached and served up a few at a time to a user each time they visit the site.  Each recommendation is easily associated with an explanation based on its seed video: "recommended because you favorited abc".  While this system isn't going to win any best paper prizes it is certainly effective: 60% of all video clicks from the YouTube homepage are for recommendations.

RecSys 2010: social recommendations

Mohsen Jamali won the best paper award at RecSys for his presentation on SocialMF, a model-based recommender designed to improve ratings-based recommendations for users who have made few ratings but who have friends, or friends of friends, who have provided plenty.  Mohsen's previous solution to the same problem was TrustWalker.  TrustWalker predicts ratings from a set of item-item similarities and a social graph, where nodes are users and edges represent trust or friendship relationships.  The rating for item i for user u is predicted by taking a short random walk on the graph, stopping at some friend-of-a-friend v and returning v's rating for item j, where j is the most similar item to i which v has rated.  Closed form expressions for these predictions don't scale at all well, so to make a prediction TrustWalker actually executes the random walk a few times and returns the average rating.  On a test set of ratings by 50k users for 100k product reviews from the Epinions website, TrustWalker does indeed show a significant benefit in both coverage and prediction accuracy for cold start users over baseline methods that don't leverage the social graph.

SocialMF is a Bayesian model-based solution to the same problem: latent factors for all users and items are learned jointly from ratings and the social graph.  Ratings for cold start users can then be predicted from their learned factors.  When tested on the epinions dataset, and a new one of ratings by 1M users for 50k movies crawled from Flixster, SocialMF again does indeed improve the accuracy of predicted ratings for cold start users over latent factor models that don't take the social graph into account.

The model-based approach is elegant and perhaps even scalable: learning time is linear in the number of users, and the paper reports a runtime of 5.5 hours for 1M users.  But it lacks the powerful explanations of the simpler system: "recommended for you because your friend xyz likes it".

Thursday, 30 September 2010

RecSys 2010: industry

One striking feature of this year's ACM Recommender Systems conference was a large industry presence: from SAP, Bloomberg, IBM and Sony, to Yahoo! and Google, as well as a host of smaller outfits. I particularly enjoyed comparing notes with the teams from YouTube, Microsoft Bing, and the b2b recommendation company MediaUnbound (now Rovi), which has provided music recommendation services for MTV and Ericsson. Some clear themes emerged from those conversations, as well as the various panels in which we were speaking.

Some researchers in the field might be surprised to hear that the predominant industry view is that predicting ratings for items isn't a very useful way to build or evaluate a recommender system. This is partly because "users don't get ratings", as Palash Nandy from Google put it at the opening panel, "they always rate 1 or 5", and partly because improving prediction of users' ratings is at best weakly aligned with the business aims of most systems. In fact all of the industry people I spoke to were suspicious of offline evaluation metrics for recommendations in general. The YouTube team had even compared some standard offline metrics evaluating two recommendation algorithms with results from online controlled experiments on their live system. They found no correlation between the two: this means that the offline metrics had no value whatsoever in telling which algorithm would do better in practice. Finally there was a strong industry concern that the design of the Netflix Prize competition had led many researchers into developing methods that could not possibly scale to real world scenarios.  Most industry people I spoke to described scalability as their number one consideration.

There were two strong beliefs that I heard stated many times in the last few days: improvements in the overall user experience dominate likely enhancements of current algorithms; and recommendations from friends remain far more powerful than those from machines.  So the sort of research that will excite industry in the near future is likely to focus on topics such as:
  • providing more compelling explanations for machine recommendations
  • exploiting social relationships to make better recommendations
  • optimising the user's interaction with particular recommenders
Where does this leave the academic researcher with good ideas but no access to real users? This question was raised eloquently by Jo Konstan from the pioneering GroupLens team at the University of Minnesota, in a panel debating the value of competitions for scientific research. Jo's plea was for someone to offer researchers not a big cash prize or even a huge dataset, but rather an experimental platform where their systems can be deployed on a modest group of users under controlled conditions, the uptake of their recommendations recorded, and users in the test group asked directly to give feedback. This approach would meet most of industry's concerns about academic research head on.  But it raises obvious issues of privacy, security, maintenance, and trust between businesses and their customers - probably enough to scare away even the most research-friendly commercial organisation.

One final thing that might interest researchers with concerns about any of these issues: pretty much everyone said that they were hiring.

Monday, 16 August 2010

ISMIR 2010: code

The Best Paper award went to Ron Weiss and Juan Bello of NYU for their work on finding repeating patterns in musical audio with Probabilistic Latent Component Analysis (aka Convolutive Non-Negative Matrix Factorisation) applied to chroma features. This finds a best fit decomposition of the features considered as a linear mixture of some number of latent components convolved with corresponding activation weights that vary over time during the course of the track. Weiss and Bello apply priors to enforce sparsity in the activations, so that only a small number of components are active at any given time. Better still this means that you can start with a large number of latent components, and the activations for most of them drop to zero throughout the track as the learning algorithm proceeds, meaning that the effective number of underlying components can be learned rather than having to be specified in advance. Finally you can estimate the high-level repeat structure of the track by picking a simple viterbi path through the activations. I thought the paper was well worth the prize: an original approach, convincing results, clear presentation at the conference, and, best of all, published python code at http://ronw.github.com/siplca-segmentation.

Ron Weiss also had a hand in the appealing Gordon Music Collection Database Management System available from http://bitbucket.org/ronw/gordon. This looks like a great lightweight framework for managing experiments based on audio feature extraction, with a python api, support for sql-like queries, automatic feature caching, and a clean web interface which includes automatic best-effort visualisations of the features for each track. I'm really looking forward to trying it out. The name Gordon apparently refers to the character in Nick Hornby's novel High Fidelity.

Opinion is increasingly divided about the pros and cons of using Flash on your website. If you still love Flash, then the Audio Processing Library for Flash, available from http://code.google.com/p/alf, now lets you do audio feature extraction in realtime, directly from your Flash movies. The developers even have a suprisingly funky example game, in which the graphics, and even some of the gameplay, are based directly on features extracted from the soundtrack as it plays: presumably this is one game which MIR researchers will always win!

Thursday, 12 August 2010

ISMIR 2010: recommendation

Brian McFee presented a poster on Learning Similarity from Collaborative Filters describing a method of addressing the cold start problem. The idea is to learn to rank tag representations of artists or tracks, so that the ranks agree with similarity scores given by collaborative filtering, using Metric Learning to Rank, an extension of the ranking SVM. This shows a modest improvement when applied to autotags inferred from audio, but really good results when applied to musicological tags scraped from Pandora. Conclusion: if you have musicological tags then you may not need CF data to get really good artist similarities.

Although at the moment musicological tags are probably even harder to come by than CF data, there are encouraging signs of progress in autotagging. A nice paper by James Bergstra, Mike Mandel and Doug Eck on Scalable Genre and Tag Prediction with Spectral Covariance shows impressive results with a scalable framework that learns from simple spectral features.

Along with several submissions to the MIREX evaluation tasks, this paper also illustrates another small trend at ISMIR 2010, which is a move away from known flawed features, in this case MFCCs used to represent musical timbre.

ISMIR 2010: crowdsourcing

Amazon Mechanical Turk seems to be gaining some ground as a way of collecting experimental groundtruth. A workshop on Creating Speech and Language Data with Amazon Mechanical Turk recently challenged researchers to come up with useful datasets for an outlay of $100, resulting in over 20 new datasets. A review paper by Chris Callison-Burch and Mark Dredze summarises some of the lessons learned.

Now Jin Ha Lee of the University of Washington has compared crowdsourced music similarity judgements from Mechanical Turk with those provided by experts for the carefully controlled MIREX Audio Music Similarity task. It took only 12 hours to crowdsource judgements compared to two weeks to gather them from experts, and, provided that the tasks were carefully designed, the results were almost identical: only 6 out of 105 pairwise comparisons between algorithms would have been reversed by using the crowdsourced judgements.

In another presentation Mike Mandel showed how social tags for short audio clips sourced from Mechanical Turk could be used for autotagging. He uses a Conditional restricted Boltzmann machine to model the cooccurrence of tags between different clips, with inputs representing the user who applied each tag, as well as the track, album and artist from which each clip was drawn. The learned weights are used to create a smoothed tag cloud from the tags provided for a clip by several users. When used as input to his autotagging classifiers, these smoothed tag clouds gave better results than simply aggregating tags across users, approaching the performance of classifiers trained on tags from the much more controlled MajorMiner game.

Wednesday, 11 August 2010

ISMIR 2010: applications

A team from Waseda University and KDDI labs are demoing a system that generates slideshows of photos from flickr from song lyrics. The images are chosen automatically for each line of the lyrics, relying in part on assigning a "general impression" for each line chosen from a small dictionary of concepts such as season of the year, weather and time of day. I'm not sure how generally applicable this system is, but the example slideshow I saw was suprisingly poetic, like the serendipitously beautiful sequences you sometimes see on the Last.fm visual radio player (which only uses artist images).

Scott Miller presented an action-packed poster on a GeoShuffle iphone app which, amongst other things, recommends music based on geo location information. In a nice experiment GeoShuffle was given to a number of users for three weeks. During that time it created playlists for them, randomly switching between several different playlist generation methods every hour or so. Meanwhile their skip rate was recorded, to see which playlist generation algorithm they liked the best. A method that chose similar songs to those listened to while travelling on the same path was by far the best. Conclusion: humans are creatures of habit!

ISMIR 2010: music transcription

The Music Information Retrieval community is a pretty broad church, but the very much unsolved problem of transcribing a full score from audio input remains at its heart. Nowadays this is approached via various somewhat simpler subtasks, such as capturing just the harmony of the music as a series of chord labels. Most existing approaches to this use the chroma feature as the main input to their labelling algorithm. Chroma purports to show the amount of energy associated with each pitch class (note of the scale) in a given short time period, but it's well known to be highly flawed: if you resynthesize the pitches implied by the chroma it usually sounds nothing like the original audio.

A convincing presentation by Matthias Mauch, currently working in Masataka Goto's AIST lab in Japan, showed how you can improve on existing chord labelling performance by using the output of a multipitch transcription algorithm instead chroma. Mauch's labeller is fairly sophisticated (a Dynamic Bayes Net with a number of inputs as well as pitch evidence), but his pitch transcription algorithm is a very simple best-fit effort: expect a flurry of papers seeing if fancy state of the art methods can work even better than the current 80% accuracy on the MIREX test set of songs by The Beatles and others.

Two posters dealt with the misleadingly-named "octave error" problem in estimating the tempo (bpm) of music from audio input: state of the art beat trackers are good at finding regular pulses, but they often have trouble telling the beat from the half-beat. I really liked Jason Hockman's approach to solving this. Instead of making any changes to his beat tracker, he simply trains a classifier to predict whether a given track is slow or fast, using training examples which Last.fm users have tagged as slow or fast. Despite using just a single aggregate feature vector for each track, which doesn't obviously contain any tempo information, his slow-fast classifier works really well, with an accuracy of over 95%. Presumably slow songs simply sound more like other slow songs than fast ones. I'll be interested to see how much this can help in tempo estimation. I'd expect that the classifier wouldn't work so well with songs that aren't obviously slow or fast enough to have been tagged as either, but I'd also guess that moderate tempo songs are less affected by octave error in the first place.

Tuesday, 6 April 2010

Graph processing for really big data

MapReduce implementations of graph algorithms like PageRank and adsorption scale to millions of nodes on a cluster of around 50 machines, but if you want to process billions (or even tens of millions, depending on your algorithm) then you need a different framework.  Google uses Pregel, about which they've said little except that it was inspired by the Bulk Synchronous Parallel model for parallel programming.

So the announcement of a BSP package for Hadoop in the Apache HAMA project could be an interesting one to watch.  There's even a BSP hello world, although getting further may be hard work with the current level of documentation.

MapReduce algorithm design

Data-Intensive Text Processing with MapReduce is a new book by Jimmy Lin and Chris Dyer of the University of Maryland.  It's due for publication later this year but a full draft is already available as a pdf.  The book shows how you can implement a variety of useful algorithms on a MapReduce cluster, including graph algorithms such as breadth first search and PageRank, and parameter estimation for latent variable models, with a detailed explanation of how to do this for Hidden Markov Models, and even a sketch for Conditional Random Fields.  Although the book is a practical manual, the algorithms are given in simple pseudo-code rather than Java classes intended for use on a Hadoop cluster.  This has huge advantages for readability, and makes it much easier for the authors to draw out some generally applicable design patterns for MapReduce algorithms.

The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them.  Lin and Dyer illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"?  The mapper counts word pairs in the corpus, so its output looks like
((dog, cat), 125)
((dog, foot), 246)
But it also keeps a running total of all the word pairs containing "dog", outputting this as
((dog,*), 5348)
Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies.  The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.

Other patterns explained in the book include the pairs and stripes approaches to produce large sparse matrix mapper output, in-mapper combining to limit the amount of mapper output written to disk (a common scalability bottleneck in MapReduce), and value-to-key conversion for relational joins of large datasets.

All in all, this book is a great complement to Tom White's Hadoop book.  An extra plus is that the pseudo code can be translated with virtually no effort into dumbo code for Hadoop.  I can see Data-Intensive Text Processing and dumbo becoming a standard way of teaching MapReduce quickly and excitingly in the classroom.

Tuesday, 30 March 2010

Last.fm are hiring!

If you're interested in the problems discussed here, and want to get paid for researching, designing, implementing and maintaining solutions to them, we're hiring.  See this post for a list of more great reasons to work at Last.fm.  We also sometimes take interns and like to collaborate with academic researchers.

Tuesday, 23 February 2010

Adsorption: scalable graph-based everything

Graph-based machine learning algorithms have a reputation for scaling badly, so I've enjoyed reading two papers describing a simple graph algorithm that does scale easily and which can be applied to a large range of problems.  Video suggestion and discovery for YouTube: taking random walks through the view graph by Shumeet Baluja and seven colleagues from Google, and Ranking and semi-supervised classification on large scale graphs using map-reduce by Delip Rao and David Yarowsky of Johns Hopkins University, both show how Label Propagation has a simple implementation in the map-reduce framework which is closely related to PageRank.

Imagine a graph in which you have labels only for some nodes, but where the edges and their associated weights show the connections between related nodes and their relative strengths.  Label Propagation allows you to infer labels for the remaining unlabelled nodes, or additional labels for those already labelled.  The Google paper considers a graph in which nodes represent videos and users, and edges between them represent views on YouTube.  User nodes are labelled with the videos they have viewed.  We can generate recommendations for a user by propagating additional labels from other users connected to them by many short paths i.e. users who have viewed some of the same videos.

The basic algorithm is so simple that, using the elegant dumbo python wrapper for Hadoop, we can write the whole thing in a few lines of code.

Let's suppose we have a log of fruit consumption here at Last.fm HQ:

norman  orange  1
norman  orange  1
mark    apple   1
klaas   orange  1
mark    banana  1
mark    apple   1
mark    apple   1
norman  banana  1
klaas   pear    1
ricky   banana  1
olivier cherry  1
norman  orange  1
klaas   cherry  1
olivier banana  1

First of all let's create a graph from this logfile.  We map the log entries:

def parse_log(value):
    user,item,count = value.split("\t")
    return user,item,int(count)
def map_log(key,value):
    user,item,count = parse_log(value)
    yield ((USER,user),(ITEM,item)),count
    yield ((ITEM,item),(USER,user)),count
and sum the counts with a dumbo.sumreducer to create the edges.  Next we map the edges

def map_edges(key,value):
    yield key[0],(key[1],value)
In the reducer we output adjacency lists, also adding a shadow or dummy node for each user, to which the observed label distributions are clamped.  We use the dumbo getpath decorator to output the adjacency lists and label distributions into different directories:

@opt("getpath", "yes")
def reduce_edges(key,values):
    values = list(values)
    yield ("adjacency_lists",key),values
    if key[0] == USER:
        yield ("adjacency_lists",(DUMMY_USER,key[1])),[(key,INJECTION_WEIGHT)]
        dist = normalise_dist(values,MAX_LABELS_PER_NODE)
        yield ("label_dists",(DUMMY_USER,key[1])),label_dist
Here's the function we use to normalise and prune label distributions, and the dumbo runner for this job:

def normalise_dist(dist,max_labels):
    dist = sorted(dist,key=itemgetter(1),reverse=True)[:max_labels]
    norm = sum(weight for label,weight in dist)
    return [(label,weight/norm) for label,weight in dist]

def runner(job):

Now we're ready to propagate the label distributions.  We do this iteratively, on each iteration sending the distribution at each node to each of its neighbours.  First we use some dumbo magic to join distributions to adjacency lists.  We just need to write a reducer:

class Reducer(JoinReducer):
    def primary(self,key,values):
        self.label_dist = values.next()
    def secondary(self,key,values):
        yield key,(self.label_dist,values.next())

Then we transmit the distribution from each node to its neighbours:

def map_propagate(key,value):
    label_dist,adjacency_list = value
    for node,weight in adjacency_list:
        yield node, [(label,prob*weight) for label,prob in label_dist]

Finally we sum and normalise the incoming distributions at each receiving node:

def reduce_sum_normalise(key,values):
    dist = defaultdict(lambda:0)
    for d in values:
        for label,prob in d:
            dist[label] += float(prob)
    dist = normalise_dist(dist.items(),MAX_LABELS_PER_NODE)
    yield key,dist

Here's the runner for this job:

def runner(job):
    multimapper = MultiMapper()
    multimapper.add("label_dists", primary(identitymapper))
    multimapper.add("adjacency_lists", secondary(identitymapper))

And that's it: a massively scalable graph-based recommender in under 50 lines of python.  Well ok, we still need to interpret the ouput.  We started with an observed distribution at the dummy node for Norman like this:

(3, 'norman')   [((2, 'orange'), 0.75), ((2, 'banana'), 0.25)]

and after a few iterations our recommender has inferred this at his node:

(1, 'norman')   [((2, 'orange'), 0.45735654387297453), ((2, 'banana'), 0.28646536729600702), ((2, 'cherry'), 0.11646859085648147), ((2, 'apple'), 0.074826725260416671), ((2, 'pear'), 0.064882772714120379)]

So next lunchtime I'll suggest he tries some cherries.

Thursday, 18 February 2010

Integrating social relationships into a music recommender

It's always nice to see people working with Last.fm's huge quantity of publicly available data, especially when they include several researchers from IBM's famous research labs. In their ACM Recommender Systems 2009 paper Augmenting collaborative recommender by fusing explicit social relationships, Q. Yuan, S. Zhao, L. Chen, Y. Liu, S. Ding, X. Zhang and W. Zheng report some experiments incorporating Last.fm's friend and group membership relationships into more conventional Collaborative Filtering systems.

Their first recommender is a user-based CF system i.e. the score given to a new recommendation of Muse for user Jim is a weighted average over all the scores (ratings, scrobbles, etc.) for users who already listen to Muse, and the weights are similarities computed between Jim and each of Muse's fans.

Yuan et al. start with a baseline CF recommender in which the similarity between users is based on cosine distance between their vectors of artist scrobbles. For reasons that aren't obvious to me, the scrobble counts are first mapped to binary values, e.g. 1 if a user has scrobbled Muse three or more times, 0 otherwise. In their first experimental recommender, these similarities are augmented by a weighted combination of similarities based on friend and group membership relationships. The relationship similarities are also computed as cosine distances between binary vectors, so the friendship similarity between two users expresses the extent to which they have friends in common, and likewise for group membership.

The second experimental system is a graph-based recommender, built on a graph of user, artist and group nodes. The graph starts with the obvious edges between users and the artists they listen to, and the groups they belong to. Then edges are added between pairs of users whose relationship similarities are more than some threshold, so each user is connected to other users with similar friends or belonging to similar groups as well as to their 'real' friends. Like the binary CF vectors in the first system, the edge weights are all set to 1. Finally user-user similarity scores are extracted from the graph using a Laplacian matrix pseudoinverse algorithm: in practice the computation required is similar to inverting a matrix containing concatenated artist, friend and group vectors for each user.

After these nice simple suggestions for combining social relationships and CF, the experiments themselves are a bit of a disappointment, mainly because the dataset they use contains only 1000 artists and users. The evaluation metrics are precision, recall and f-value over the top few recommendations, where a recommendation is considered correct if the user actually scrobbled the reommended artist at least three times. In the CF recommender, using weights that seem to be rather highly fitted to the small dataset, there is a 2-3% improvement in f-value over the top five recommendations. The graph recommender shows an 8% improvement, but they don't give baseline results from a graph not including the social relationships, so it's not clear whether this is due to the extra social information or just the different algorithm.

In fact using such a small dataset raises all sorts of problems in interpreting the results:
  • it's hard to tell if absolute values are good or bad
  • is it sensible to evaluate the top-5 recommendations from a pool of only 1000 artists?
  • are the improvements significant?
  • are the various parameters overfitted?
  • does the graph method scale?
It makes me suspect that IBM are just playing in the recommender space, while others are clearly serious about it.

Sunday, 24 January 2010

Getting names right: an unsolved problem

One of the best things about working at Last.fm is the opportunity to do applied research. Not so much an opportunity, in fact, as a necessity, since many of the problems we tackle in the Music Information Retrieval team have no established solution. One challenge which I'm working on at the moment is deciding which of the millions of artist names in our catalogue may in fact refer to the same people; and likewise for albums and tracks. Where we find more than one spelling or version of a particular name we then also have to decide if one of them should be chosen as the correct one, and the others automatically corrected to it.

Historically the first of these problems has been well studied under several different names in different fields of computer science, including duplicate detection, record linkage and entity or coreference resolution. Funnily enough, in the case of a music catalogue, detecting duplicates is at least partially solved by the beauty of audio fingerprinting, which tells us with high confidence if two tracks are the same or not. The second task, usually known as name canonicalization, is much more difficult for us, and unfortunately has grabbed much less academic research attention over the years.

One recent exception is a 2009 paper on coreference resolution by Michael Wick, Aron Culotta, Khashayar Rohanimanesh and Andrew McCallum. This describes a joint graphical model (a Conditional Random Field) that can learn coreferences and canonical names at the same time. This requires reasoning about both entities and the various mentions of those entities. To do this you need to compute aggregated features that express the similarity of entire clusters of mentions to entities or other clusters of mentions. Real-valued features can be aggregated in a conventional way, for example the string distance between two groups of names can be expressed as the average or maximum of the pairwise distances. Boolean features such as exact matches can be aggregated with logical operators - a nice idea that was new to me - for example we can say that all of set of mentions of a track match a given title, or that at least one of them does, a majority of them does, etc.

According to Wick et al., doing canonicalization jointly with entity resolution leads to a big improvement in the number of duplicates identified. Canonical entities are estimated as centroids of each group of related names, where distance is measured with a tunable edit distance. This is based on the assumption that spelling variations are essentially random typos: this is frequently not true for artist names, where variations such as Avengers and The Avengers, or An Cafe and アンティック-珈琲店- are surprisingly common. Last.fm's catalogue includes huge amounts of metadata supplied by users with their scrobbles, typically in the form of artist, album and track fields in the ID3 tags of mp3 files. As a result, even by the most conservative estimate, many millions of artist names in our catalogue need correction. Trusted catalogues such as MusicBrainz contain only around 500k aritsts, making both the problem of canonicalization, and the question of how to evaluate any proposed solution, an important practical issue for us. Despite their nice results on entity resolution, Wick et al. don't evalute the results of their canonicalization itself at all, saying simply that “we have no data on which to evaluate [its] performance”. This is a pity, and avoids one of the main research challenges here, because shortage of trustworthy data is only to be expected in any scenario where large-scale automatic canonicalization is necessary.

At Last.fm we have come up with several different measures to evalute our correction system. Firstly we compute the overall likelihood of our mappings to canonical names based on probabilistic estimates of the accuracy of trusted catalogues, and of the proportion of names in our catalogue which are likely to be incorrect. This turns out to be equivalent to assigning different weights to false positive and false negative corrections, and estimating their total weight over all our corrections. Unfortunately the required probabilities or weights are very hard to estimate accurately, so this measure remains rather speculative. Secondly we use a small groundtruth marked up by hand, which works well but clearly doesn't scale. Finally we look at the votes we receive for different forms of each artist name, through the suggest a correction feature on artist pages: this gives us an approximate measure of precision, based on the number of artists where users vote strongly against our corrections to canonical names, and recall, based on the number of artists where users vote strongly to say that we need to make a correction where we aren't making one. The performance estimates produced by these different measures vary, but the precision of our system looks fairly good, at over 97% even according to the worst estimate. Our estimates of recall vary a lot, but it may be as low as 40%, leaving plenty of headroom for making more corrections.

So for any new researchers in IR looking for wide open research topic with practical applications, name canonicalization is well worth a look!