Historically the first of these problems has been well studied under several different names in different fields of computer science, including duplicate detection, record linkage and entity or coreference resolution. Funnily enough, in the case of a music catalogue, detecting duplicates is at least partially solved by the beauty of audio fingerprinting, which tells us with high confidence if two tracks are the same or not. The second task, usually known as name canonicalization, is much more difficult for us, and unfortunately has grabbed much less academic research attention over the years.
One recent exception is a 2009 paper on coreference resolution by Michael Wick, Aron Culotta, Khashayar Rohanimanesh and Andrew McCallum. This describes a joint graphical model (a Conditional Random Field) that can learn coreferences and canonical names at the same time. This requires reasoning about both entities and the various mentions of those entities. To do this you need to compute aggregated features that express the similarity of entire clusters of mentions to entities or other clusters of mentions. Real-valued features can be aggregated in a conventional way, for example the string distance between two groups of names can be expressed as the average or maximum of the pairwise distances. Boolean features such as exact matches can be aggregated with logical operators - a nice idea that was new to me - for example we can say that all of set of mentions of a track match a given title, or that at least one of them does, a majority of them does, etc.
According to Wick et al., doing canonicalization jointly with entity resolution leads to a big improvement in the number of duplicates identified. Canonical entities are estimated as centroids of each group of related names, where distance is measured with a tunable edit distance. This is based on the assumption that spelling variations are essentially random typos: this is frequently not true for artist names, where variations such as Avengers and The Avengers, or An Cafe and アンティック-珈琲店- are surprisingly common. Last.fm's catalogue includes huge amounts of metadata supplied by users with their scrobbles, typically in the form of artist, album and track fields in the ID3 tags of mp3 files. As a result, even by the most conservative estimate, many millions of artist names in our catalogue need correction. Trusted catalogues such as MusicBrainz contain only around 500k aritsts, making both the problem of canonicalization, and the question of how to evaluate any proposed solution, an important practical issue for us. Despite their nice results on entity resolution, Wick et al. don't evalute the results of their canonicalization itself at all, saying simply that “we have no data on which to evaluate [its] performance”. This is a pity, and avoids one of the main research challenges here, because shortage of trustworthy data is only to be expected in any scenario where large-scale automatic canonicalization is necessary.
At Last.fm we have come up with several different measures to evalute our correction system. Firstly we compute the overall likelihood of our mappings to canonical names based on probabilistic estimates of the accuracy of trusted catalogues, and of the proportion of names in our catalogue which are likely to be incorrect. This turns out to be equivalent to assigning different weights to false positive and false negative corrections, and estimating their total weight over all our corrections. Unfortunately the required probabilities or weights are very hard to estimate accurately, so this measure remains rather speculative. Secondly we use a small groundtruth marked up by hand, which works well but clearly doesn't scale. Finally we look at the votes we receive for different forms of each artist name, through the suggest a correction feature on artist pages: this gives us an approximate measure of precision, based on the number of artists where users vote strongly against our corrections to canonical names, and recall, based on the number of artists where users vote strongly to say that we need to make a correction where we aren't making one. The performance estimates produced by these different measures vary, but the precision of our system looks fairly good, at over 97% even according to the worst estimate. Our estimates of recall vary a lot, but it may be as low as 40%, leaving plenty of headroom for making more corrections.
So for any new researchers in IR looking for wide open research topic with practical applications, name canonicalization is well worth a look!
No comments:
Post a Comment