Tuesday, 6 April 2010

MapReduce algorithm design

Data-Intensive Text Processing with MapReduce is a new book by Jimmy Lin and Chris Dyer of the University of Maryland.  It's due for publication later this year but a full draft is already available as a pdf.  The book shows how you can implement a variety of useful algorithms on a MapReduce cluster, including graph algorithms such as breadth first search and PageRank, and parameter estimation for latent variable models, with a detailed explanation of how to do this for Hidden Markov Models, and even a sketch for Conditional Random Fields.  Although the book is a practical manual, the algorithms are given in simple pseudo-code rather than Java classes intended for use on a Hadoop cluster.  This has huge advantages for readability, and makes it much easier for the authors to draw out some generally applicable design patterns for MapReduce algorithms.

The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them.  Lin and Dyer illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"?  The mapper counts word pairs in the corpus, so its output looks like
((dog, cat), 125)
((dog, foot), 246)
But it also keeps a running total of all the word pairs containing "dog", outputting this as
((dog,*), 5348)
Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies.  The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.

Other patterns explained in the book include the pairs and stripes approaches to produce large sparse matrix mapper output, in-mapper combining to limit the amount of mapper output written to disk (a common scalability bottleneck in MapReduce), and value-to-key conversion for relational joins of large datasets.

All in all, this book is a great complement to Tom White's Hadoop book.  An extra plus is that the pseudo code can be translated with virtually no effort into dumbo code for Hadoop.  I can see Data-Intensive Text Processing and dumbo becoming a standard way of teaching MapReduce quickly and excitingly in the classroom.


  1. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    rpa training in bangalore
    best rpa training in bangalore
    RPA training in bangalore
    rpa course in bangalore
    rpa training in chennai
    rpa online training

  2. Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    AWS Course Interview Questions and Answers for Freshers | AWS Interviews Questions and Answers for Devops
    AWS Tutorial – Besant Technologies
    AWS Interview questions and answers for Sysops |AWS Interview Question and Answers BlogSpot
    AWS Tutorial For Beginners | AWS Tutorial with AWS Training Videos

  3. I read this post two times, I like it so much, please try to keep posting & Let me introduce other material that may be good for our community.

    online Python training
    python training in chennai

  4. Really very nice blog information for this one and more technical skills are improve,i like that kind of post.
    AWS Training in Bangalore
    AWS training in sholinganallur
    AWS training in Tambaram
    AWS training in Velachery

  5. This comment has been removed by the author.

  6. Thanks for sharing such helpful information with all of us I appreciate your effort of writing a value able piece of content.
    [url=http://procinehub.com/]best baby photographer in jalandhar[/url]
    [url=http://procinehub.com/]best fashion photographer in Chandigarh[/url]
    [url=https://www.styleandgeek.com/home-remedies-hair-fall//]home remedies for hair fall[/url]
    [url=https://www.styleandgeek.com/top-25-home-remedies-to-remove-tanning//home-remedies-hair-fall//]home remedies to get rid of tanning[/url]
    [url=https://www.lms.coim.in//]Online Digital Marketing Training[/url]

  7. Great to share this information thanks. I am really happy to say it’s an interesting post to read. I learn new information from your blog.