Tuesday, June 15, 2010

High Level Hadoop MapReduce: Rookie/Novice/fresher...

MapReduce borrows a lot from functional programming(Lisp/ML/Scheme). Func. prog. expects to process lists of data frequently, hence they have a lot of inbuilt iterator-mechanisms and higher-level functions called list-comprehensions that are operators over lists. Two of these operators are map and reduce. Like map operator, Mapper takes a record (assumed to be a key-value pair) but can emit multiple key-value pairs(map operator does only one). A characteristic of MapRed paradigm is that mapper should be processing individual records in isolation of one-another.(One Record = up to you- depends on the class that loads the data from the block into mapper as k-v pair records).
Reducer takes a key and list of values and can emit none, one or multiple key-value pairs.

It's a general notion to ignore the key that's input to a map task. They could be byte offset to a chunk of data.

People also skip mapper or more often reducer, if it suffices for the app.
Reducers don't start until all mappers complete. They run on the same nodes as mappers(after mappers complete).
All values with same keys are collected from all mappers(that emitted that key) and sent to same reducer. This involves network communication. If multiples keys are processed by one reducer, they are presented to it in sorted order(No assumptions about the values of those keys). This is called "sort and shuffle" phase handled by the underlying framework. A single reduce task processes single key(it takes one key and list of values as input).

Within sort and shuffle, there can be a user-defined combine task that's run on the mapper's intermediate results. If app-logic allows it could be same as reducer code(eg: if reducer is commutative and associative). Combining phase is disabled by default. It runs on the mapper machine. It's solely for reducing network data-load and load on reducer. Don't perform any data-specific operation in combine phase. It may run zero, one or more times.(depends on size of data). Example is the word-count map reduce jobs. Mapper emits [, 1] . Combiner aggragates that mapper's results to [, n](no. of times the word occurred in the input blocks local to that machine). Then comes the reducers(mappers die).

Thank You.

No comments:

Post a Comment