Big Data Hadoop & Spark - Advanced

Writing MapReduce in Python using Hadoop Streaming

In this Blog we will be discussing execution of MapReduce application in Python using Hadoop Streaming.

We will be learning about streaming feature of hadoop which allow developers to write Mapreduce applications in other languages like Python and C++.

We will be starting our discussion with hadoop streaming which has enabled users to write MapReduce applications in a pythonic way.

We have used hadoop-2.6.0 for execution of the MapReduce Job.

Hadoop streaming can run MapReduce jobs in practically any language .

It Uses Unix Streams as communication mechanism between Hadoop and your code from any language that can read standard input and write standard output.

Steps implemented in Hadoop Streaming:

  • Mappers and Reducers are used as executables which read the input from stdin and emit the output to stdout.

  • First, the Mapper task converts the input into lines and places it into the stdin of the process.

  • After this the Mapper collects the output of the process from stdout and converts it into key/value pair.

  • These key/value pairs are the actual output from the Mapper task.

  • First it converts the key/value pairs into lines and put it into the stdin of the process.

  • Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job.

Steps for Execution:

Step 1: We need to create one input file and store that in HDFS,refer the below screenshot for the same

StepĀ  2: We need to first write mapper class as shown below and save it in our local directory,in our case we have saved it in /home/acadgild directory.

Creating the mapper file and writing script:


#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)

Step 3: We need to write reducer class in Python as shown below and save it in our local directory,in our case we have saved the file in /home/acadgild directory.

Creating the Reducer file and writing script in it:

#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
count = int(count)
except ValueError:
if current_word == word:
current_count += count
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)

Step 4: Execution of MapReduce using Hadoop Streaming Jar.

hadoop jar hadoop-streaming-2.6.0.jar -file /home/acadgild/ -mapper /home/acadgild/ -file /home/acadgild/ -reducer /home/acadgild/ -input /my_input -output /my_output3

Step 5: View the output from HDFS using the below command.

We hope this blog helped you in understanding hadoop streaming and executing MapReduce program in Python.

Keep visiting our website for more blogs on big data technologies.


Leave a Reply

Your email address will not be published. Required fields are marked *

Related Articles