This topic describes how to use Python to submit a Hadoop Streaming job.
Prerequisites
An E-MapReduce (EMR) Hadoop cluster is created.
For more information about how to create a cluster, see Create a cluster.
Procedure
- Log on to the Hadoop cluster in SSH mode. For more information, see Connect to the master node of an EMR cluster in SSH mode.
- Create a file named mapper.py.
- Run the following command to create a file named mapper.py and open the file:
vim /home/hadoop/mapper.py
- Press the
I
key to switch to the edit mode.
- Add the following information to the mapper.py file:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
- Press
Esc
to exit the edit mode. Then, enter :wq
to save and close the file.
- Create a file named reducer.py.
- Run the following command to create a file named reducer.py and open the file:
vim /home/hadoop/reducer.py
- Press the
I
key to switch to the edit mode.
- Add the following information to the reducer.py file:
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
- Press
Esc
to exit the edit mode. Then, enter :wq
to save and close the file.
- Run the following command to upload the hosts file to HDFS:
hdfs dfs -put /etc/hosts /tmp/
- Run the following command to submit a Hadoop Streaming job:
hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-X.X.X.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output
Parameter |
Description |
input |
The input path. In this example, the input path is /tmp/hosts.
|
output |
The output path. In this example, the output path is /tmp/output.
|
Note In hadoop-streaming-X.X.X.jar, X.X.X
indicates the version of the JAR package. The version of the JAR package must be
the same as the Hadoop version of your cluster. You can view the version of the JAR
package in the /usr/lib/hadoop-current/share/hadoop/tools/lib/ directory.