Thursday, 11 July 2013

New Blog: Data Science in Action

A couple of months ago I left to join the data science team at Mendeley, meaning that for pretty much the first time in my life I'm not officially working on anything to do with music.  So it seemed time to get a new name for my blog.  Although I'm not personally convinced that "data scientist" actually means anything in particular, it's in my new job title, and of course the Harvard Business Review recently called it "the sexiest job of the 21st century", so I've gone with the flow.

If you've found anything of interest here then please follow me over to

Thursday, 18 April 2013

Contributing to GraphChi

GraphChi is a framework for processing large graphs efficiently on a single machine, developed by Aapo Kyrölä of CMU as a spin off from the impressive GraphLab distributed graph processing project.  Both GraphLab and GraphChi come with a really handy Collaborative Filtering Toolbox, implementing numerous recent algorithms and developed at CMU by Danny Bickson.

GraphChi looks like a great project so I decided to try to contribute to it and took the chance to implement an algorithm that I'd been wanting to investigate with for a while: Collaborative Less-is-More Filtering, developed by Yue Shi at TU Delft, which won a best paper award at RecSys last year.  CLiMF optimises for the Mean Reciprocal Rank of correctly predicted items i.e. it's designed to promote accuracy and diversity in recommendations at the same time.  Although it's really intended for binary preference data like follow or friend relations, it's easy to implement a threshold on ratings that automatically binarises them during learning, so CLiMF can also be used with ratings datasets.

Danny made contributing to the toolbox really easy and CLiMF is now available in GraphChi, and documented alongside the other algorithms.  I also wrote a simple Python implementation which works fine for small datasets and which was useful for reference.

You can get the latest version of GraphChi and the collaborative filtering toolbox from here.

Thursday, 21 March 2013

Hadoop and beyond: power tools for data mining

Last week Dell Zhang kindly invited me to give a guest lecture to Birkbeck and UCL students on his Cloud Computing course.  It was fun to show them some of the tools that make Hadoop development easier and more maintainable, and also some of the problems for which Hadoop is not a magic bullet.

I took the chance to evangelise about some of my favourite tools and frameworks, and to spend a bit of time getting to know some new ones.  I specially enjoyed taking a look at Scalding, Spark and GraphChi.

Here are the slides: