In this Blog we will be discussing execution of MapReduce application in Python using Hadoop Streaming.
We will be learning about streaming feature of hadoop which allow developers to write Mapreduce applications in other languages like Python and C++.
We will be starting our discussion with hadoop streaming which has enabled users to write MapReduce applications in a pythonic way.
We have used hadoop-2.6.0 for execution of the MapReduce Job.
Hadoop streaming can run MapReduce jobs in practically any language .
It Uses Unix Streams as communication mechanism between Hadoop and your code from any language that can read standard input and write standard output.
Steps implemented in Hadoop Streaming:
Mappers and Reducers are used as executables which read the input from stdin and emit the output to stdout.
First, the Mapper task converts the input into lines and places it into the stdin of the process.
After this the Mapper collects the output of the process from stdout and converts it into key/value pair.
These key/value pairs are the actual output from the Mapper task.
First it converts the key/value pairs into lines and put it into the stdin of the process.
Then the reducer collects the line output from the stdout of the process and prepare key/value pairs as the final output of the reduce job.
Steps for Execution:
Step 1: We need to create one input file and store that in HDFS,refer the below screenshot for the same
Step 2: We need to first write mapper class as shown below and save it in our local directory,in our case we have saved it in /home/acadgild directory.
Creating the mapper file and writing script:
#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1)
Step 3: We need to write reducer class in Python as shown below and save it in our local directory,in our case we have saved the file in /home/acadgild directory.
Creating the Reducer file and writing script in it:
#!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count)
Step 4: Execution of MapReduce using Hadoop Streaming Jar.
hadoop jar hadoop-streaming-2.6.0.jar -file /home/acadgild/my_first_mapper.py -mapper /home/acadgild/my_first_mapper.py -file /home/acadgild/my_first_reducer.py -reducer /home/acadgild/my_first_reducer.py -input /my_input -output /my_output3
Step 5: View the output from HDFS using the below command.
We hope this blog helped you in understanding hadoop streaming and executing MapReduce program in Python.
Keep visiting our website www.acadgild.com for more blogs on big data technologies.