MapReduce jobs on E-MapReduce (EMR) can read and write data directly in JindoFileSystem (JindoFS) by pointing the job's input and output directories to jfs:// paths. No code changes are required.
How JindoFS works with MapReduce
By default, Hadoop MapReduce reads and writes data through Hadoop Distributed File System (HDFS). JindoFS is fully compatible with the HDFS API, so the MapReduce framework accesses JindoFS paths the same way it accesses HDFS paths. The data is stored in the OSS bucket backing your JindoFS namespace, not on the local file system.
To redirect a job to JindoFS, replace the hdfs:// path prefix with jfs://<namespace>/. All Map tasks, Reduce tasks, scheduling, and fault-tolerant rerun logic work without modification.
Prerequisites
Before you begin, make sure you have:
-
An EMR cluster with JindoFS configured
-
A JindoFS namespace with the following properties (the examples below use a namespace named
emr-jfs):jfs.namespaces=emr-jfs jfs.namespaces.emr-jfs.oss.uri=oss://oss-bucket/oss-dir jfs.namespaces.emr-jfs.mode=block
Run a MapReduce pipeline on JindoFS
The following example runs a complete teragen-terasort pipeline on JindoFS. Teragen generates a dataset and writes it to JindoFS; Terasort reads that output, sorts it, and writes the sorted result back to JindoFS.
Step 1: Generate data with Teragen
Run Teragen to generate 100,000 rows of data and write them to JindoFS:
hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar teragen 100000 jfs://emr-jfs/teragen_data_0
| Argument | Description |
|---|---|
100000 |
Number of rows to generate. Replace with your target row count. |
jfs://emr-jfs/teragen_data_0 |
Output path in JindoFS. Replace emr-jfs with your namespace name and adjust the directory name as needed. |
Step 2: Sort data with Terasort
Run Terasort to read the generated data from JindoFS, sort it, and write the sorted output back to JindoFS:
hadoop jar /usr/lib/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar terasort jfs://emr-jfs/teragen_data_0/ jfs://emr-jfs/terasort_data_0
| Argument | Description |
|---|---|
jfs://emr-jfs/teragen_data_0/ |
Input path — the directory written in step 1. |
jfs://emr-jfs/terasort_data_0 |
Output path for the sorted data. Adjust the directory name if you changed it in step 1. |
Both paths point to the OSS bucket configured in your JindoFS namespace. The MapReduce framework reads and writes data there transparently.