Reducer takes a key and list of values and can emit none, one or multiple key-value pairs.
It's a general notion to ignore the key that's input to a map task. They could be byte offset to a chunk of data.
People also skip mapper or more often reducer, if it suffices for the app.
Reducers don't start until all mappers complete. They run on the same nodes as mappers(after mappers complete).
All values with same keys are collected from all mappers(that emitted that key) and sent to same reducer. This involves network communication. If multiples keys are processed by one reducer, they are presented to it in sorted order(No assumptions about the values of those keys). This is called "sort and shuffle" phase handled by the underlying framework. A single reduce task processes single key(it takes one key and list of values as input).
Within sort and shuffle, there can be a user-defined combine task that's run on the mapper's intermediate results. If app-logic allows it could be same as reducer code(eg: if reducer is commutative and associative). Combining phase is disabled by default. It runs on the mapper machine. It's solely for reducing network data-load and load on reducer. Don't perform any data-specific operation in combine phase. It may run zero, one or more times.(depends on size of data). Example is the word-count map reduce jobs. Mapper emits [
Thank You.
No comments:
Post a Comment