edit-icon download-icon

Hadoop Streaming

Last Updated: Jan 09, 2018

Write hadoop streaming jobs using python

The mapper codes are as follows:

  1. #!/usr/bin/env python
  2. import sys
  3. for line in sys.stdin:
  4. line = line.strip()
  5. words = line.split()
  6. for word in words:
  7. print '%s\t%s' % (word, 1)

The reducer codes are as follows:

  1. #!/usr/bin/env python
  2. from operator import itemgetter
  3. import sys
  4. current_word = None
  5. current_count = 0
  6. word = None
  7. for line in sys.stdin:
  8. line = line.strip()
  9. word, count = line.split('\t', 1)
  10. try:
  11. count = int(count)
  12. except ValueError:
  13. continue
  14. if current_word == word:
  15. current_count += count
  16. else:
  17. if current_word:
  18. print '%s\t%s' % (current_word, current_count)
  19. current_count = count
  20. current_word = word
  21. if current_word == word:
  22. print '%s\t%s' % (current_word, current_count)

Assuming that the mapper codes are saved under /home/hadoop/mapper.py, the reducer codes are saved under /home/hadoop/reducer.py, the input path is hdfs file system’s /tmp/input, and the output path is hdfs file system’s /tmp/output, submit the following hadoop command on the E-MapReduce cluster.

hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output

Thank you! We've received your feedback.