Map Reduce in Simple Words:
Input splits–>Map phase–>Reduce phase
In this blog, we will be discussing about how MapReduce operations are performed internally with an real time example.
Generally, when a user runs a Hadoop Job using tools like Java, Pig or Hive, the Hadoop code will be executed into three phases i.e, Map phase, Sort and Shuffling phase, and Reduce phase.
Here, Map Phase is the primary phase of Hadoop MapReduce programming structure. In the Map phase, one map task is created for each input split which then executes map function for each row in the split. Each map task emits a key, value pair result record.
The keys generated by the mapper are automatically sorted by MapReduce Framework, i.e, before sending the key, value pair results to the reduce phase, all the intermediate key-value pair results which are generated by mapper phase get sorted by key.
In the sort and shuffling phase, an individual unique key is generated and its associated values are grouped together, these key values pair then will be sent to the reducer phase for further aggregation operations. The process of moving sorted map task key, values pair results to the reducers is known as shuffling.
Finally, the sort and shuffled map output is transferred to the reduce task and the key values pair will be aggregated as per the user defined reduce code and the resultant output will be stored in the HDFS.
Comparison of Real Life Example with Map Reduce:
In India, after the elections, all the EVM’s are brought to one place for counting and then Polling officers perform the count of the votes stored in EVM.
This means the actual work is done by polling officer but that work is performed on EVM machine.
Let’s now link the components of the election with the actual components of MapReduce.
Input Splits: Here, Input split is the EVM’s that corresponds to one polling booth and votes stored in one EVM is calculated by one polling officer.
Map Phase: In the Map phase, each Polling officer gets the ballot count of each candidate, in his respective polling booth. This is done simultaneously for each polling booth. From here, each candidate will become the key and the number of votes for the candidate will be the values.
Reduce Phase: In the Reduce phase, the ballot count for each booth under a parliament seat position is taken and results are generated for each candidate.
Which means that all the individual results of each polling booth will be collected and counted based on the keys. Finally, the total number of votes generated for that candidate will be calculated.
The process is represented pictorially as shown in the below image.
Here, for each booth, there will be several EVMs and people will cast their votes in it. In this example, there are three candidates R, G, B.
In the Map phase, key and values are prepared for each booth. Key will be the candidate and the Values are the number of votes for that candidate in that particular booth.
After the Map phase, sorting and shuffling is done, where the candidate (key) in each Mapper are shuffled and accumulated at one place.
In the Reduce phase, all the keys are sorted and the values for each key will be counted and finally, the total votes for each candidate will be calculated.
For more real time analysis on mapreduce codes you can go through the below link blogs.
Youtube Data Analysis
Titanic Data Analysis
Uber Data Analysis
We hope this post helped you in understanding how Map Reduce works and how it is implemented. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies