The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them. Lin and Dyer illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"? The mapper counts word pairs in the corpus, so its output looks like
((dog, cat), 125)
((dog, foot), 246)
...
But it also keeps a running total of all the word pairs containing "dog", outputting this as
((dog,*), 5348)
Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies. The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.
Other patterns explained in the book include the pairs and stripes approaches to produce large sparse matrix mapper output, in-mapper combining to limit the amount of mapper output written to disk (a common scalability bottleneck in MapReduce), and value-to-key conversion for relational joins of large datasets.
All in all, this book is a great complement to Tom White's Hadoop book. An extra plus is that the pseudo code can be translated with virtually no effort into dumbo code for Hadoop. I can see Data-Intensive Text Processing and dumbo becoming a standard way of teaching MapReduce quickly and excitingly in the classroom.
0 comments:
Post a Comment