This topic describes how to use Python to submit a Hadoop Streaming job.

Prerequisites

An E-MapReduce (EMR) Hadoop cluster is created.

For more information about how to create a cluster, see Create a cluster.

Procedure

  1. Log on to the Hadoop cluster in SSH mode. For more information, see Connect to the master node of an EMR cluster in SSH mode.
  2. Create a file named mapper.py.
    1. Run the following command to create a file named mapper.py and open the file:
      vim /home/hadoop/mapper.py
    2. Press the I key to switch to the edit mode.
    3. Add the following information to the mapper.py file:
      #!/usr/bin/env python
      import sys
      for line in sys.stdin:
          line = line.strip()
          words = line.split()
          for word in words:
              print '%s\t%s' % (word, 1)
    4. Press Esc to exit the edit mode. Then, enter :wq to save and close the file.
  3. Create a file named reducer.py.
    1. Run the following command to create a file named reducer.py and open the file:
      vim /home/hadoop/reducer.py
    2. Press the I key to switch to the edit mode.
    3. Add the following information to the reducer.py file:
      #!/usr/bin/env python
      from operator import itemgetter
      import sys
      current_word = None
      current_count = 0
      word = None
      for line in sys.stdin:
          line = line.strip()
          word, count = line.split('\t', 1)
          try:
              count = int(count)
          except ValueError:
              continue
          if current_word == word:
              current_count += count
          else:
              if current_word:
                  print '%s\t%s' % (current_word, current_count)
              current_count = count
              current_word = word
      if current_word == word:
          print '%s\t%s' % (current_word, current_count)
    4. Press Esc to exit the edit mode. Then, enter :wq to save and close the file.
  4. Run the following command to upload the hosts file to HDFS:
    hdfs dfs -put /etc/hosts /tmp/
  5. Run the following command to submit a Hadoop Streaming job:
    hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-X.X.X.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output
    Parameter Description
    input The input path. In this example, the input path is /tmp/hosts.
    output The output path. In this example, the output path is /tmp/output.
    Note In hadoop-streaming-X.X.X.jar, X.X.X indicates the version of the JAR package. The version of the JAR package must be the same as the Hadoop version of your cluster. You can view the version of the JAR package in the /usr/lib/hadoop-current/share/hadoop/tools/lib/ directory.