E-MapReduce:Use MapReduce to process data in JindoFS - E-MapReduce

MapReduce jobs on E-MapReduce (EMR) can read and write data directly in JindoFileSystem (JindoFS) by pointing the job's input and output directories to jfs:// paths. No code changes are required.

How JindoFS works with MapReduce

By default, Hadoop MapReduce reads and writes data through Hadoop Distributed File System (HDFS). JindoFS is fully compatible with the HDFS API, so the MapReduce framework accesses JindoFS paths the same way it accesses HDFS paths. The data is stored in the OSS bucket backing your JindoFS namespace, not on the local file system.

To redirect a job to JindoFS, replace the hdfs:// path prefix with jfs://<namespace>/. All Map tasks, Reduce tasks, scheduling, and fault-tolerant rerun logic work without modification.

Prerequisites

Before you begin, make sure you have:

An EMR cluster with JindoFS configured

A JindoFS namespace with the following properties (the examples below use a namespace named emr-jfs):

jfs.namespaces=emr-jfs
jfs.namespaces.emr-jfs.oss.uri=oss://oss-bucket/oss-dir
jfs.namespaces.emr-jfs.mode=block

Run a MapReduce pipeline on JindoFS

The following example runs a complete teragen-terasort pipeline on JindoFS. Teragen generates a dataset and writes it to JindoFS; Terasort reads that output, sorts it, and writes the sorted result back to JindoFS.

Step 1: Generate data with Teragen

Run Teragen to generate 100,000 rows of data and write them to JindoFS:

hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 100000 jfs://emr-jfs/teragen_data_0

Argument	Description
`100000`	Number of rows to generate. Replace with your target row count.
`jfs://emr-jfs/teragen_data_0`	Output path in JindoFS. Replace `emr-jfs` with your namespace name and adjust the directory name as needed.

Step 2: Sort data with Terasort

Run Terasort to read the generated data from JindoFS, sort it, and write the sorted output back to JindoFS:

hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort jfs://emr-jfs/teragen_data_0/ jfs://emr-jfs/terasort_data_0

Argument	Description
`jfs://emr-jfs/teragen_data_0/`	Input path — the directory written in step 1.
`jfs://emr-jfs/terasort_data_0`	Output path for the sorted data. Adjust the directory name if you changed it in step 1.

Both paths point to the OSS bucket configured in your JindoFS namespace. The MapReduce framework reads and writes data there transparently.