I am reading Eric Sammer’s Hadoop operations. In the book, regarding map reduce performance/tuning, mapred.reduce.parallel.copies – “Each reducer task must fetch intermediate map output data from each of the task trackers where a map task from the same job ran. In other words there are Reducers x Mappers number of total copies that must be performed”.
Questions :
1. A reducer task will only work on a single key. Is this true? So if a job is dealing with data having 10 keys, that would mean 10 reducer tasks for the job?
2. So a reducer working on a particular key on a particular node, will get data for the same key from the intermediate data locally as well as from the tasktrackers/intermediate data on other nodes running the same job? BTW is data not sorted in hdfs, so that data for a key is mostly locally on a particular node – than spread across nodes?
3. So all the reducers for the job running on different nodes, complete their aggregation for their set of data – but finally who collects/compiles the data from the various reducers and produces final output?
Appreciate the insights.