E-MapReduce:Use JindoFS in EMR V3.20.0 to V3.22.0 (V3.22.0 excluded) - E-MapReduce

JindoFS is a cloud-native file system for E-MapReduce (EMR) that uses Object Storage Service (OSS) as its storage backend, combined with local disk caching and a dedicated metadata service. Use JindoFS when you need HDFS-level metadata performance and elastic storage that scales independently of your compute cluster.

Note

This topic covers EMR V3.20.0 to V3.22.0 (V3.22.0 excluded). For EMR V3.22.0 and later, see Use JindoFS in EMR V3.22.0 to V3.26.3.

How it works

JindoFS uses two internal services to separate storage from metadata:

Storage Service writes data to OSS, ensuring high reliability through OSS's built-in redundancy. Frequently accessed data is also cached on local cluster disks to accelerate reads.
Namespace Service manages file metadata locally. Metadata queries go to Namespace Service instead of OSS, giving JindoFS metadata query performance similar to Hadoop Distributed File System (HDFS).

JindoFS supports two storage modes: block storage mode and cache mode. This topic covers block storage mode.

Choose a storage system

EMR provides three storage systems: OssFileSystem, HDFS, and JindoFS. OssFileSystem and JindoFS both use OSS as the storage backend.

Feature	Hadoop OSS	EMR OssFileSystem	EMR HDFS	EMR JindoFS
Storage capacity	Tremendous	Tremendous	Depends on cluster scale	Tremendous
Reliability	High	High	High	High
Throughput factor	Server	Disk I/O of caches	Disk I/O	Disk I/O
Metadata query efficiency	Low	Medium	High	High
Scale-out	Easy	Easy	Easy	Easy
Scale-in	Easy	Easy	Node decommission required	Easy
Data locality	None	Weak	Strong	Medium

When to use JindoFS (block storage mode):

High metadata query volume: JindoFS delivers HDFS-level metadata performance, making it suitable for workloads that list, stat, or rename files frequently.
Large-scale data with elastic clusters: Storage scales independently of the cluster. Scale the cluster in or out without decommissioning nodes or migrating data.
Write Once Read Many (WORM) workloads: Local disk caching accelerates repeated reads on a fixed dataset.
Data locality matters: JindoFS stores local backups on the nodes that originally wrote the data, reducing network I/O during subsequent reads.

Prerequisites

Before you begin, ensure that you have:

An EMR cluster running V3.20.0 to V3.22.0 (V3.22.0 excluded), with SmartData and Bigboot selected during cluster creation. For instructions, see Create a cluster.
An OSS bucket. Store the bucket in the same region as your EMR cluster to ensure high performance and stability, and to enable password-free access without configuring an AccessKey pair.

Note

Bigboot provides distributed data management and component management services. SmartData builds on Bigboot to expose JindoFS to the application layer.

Configure JindoFS

Two configuration methods are available. Use the post-creation method if the cluster already exists, or the pre-creation method to automate configuration at cluster startup.

Configure after cluster creation

In the EMR console, go to the Bigboot configuration page for your cluster.

Set the following parameters. Parameters framed in red in the console are required.

Note

JindoFS supports multiple namespaces. The examples in this topic use a namespace named test.

Parameter	Description	Example
`oss.access.bucket`	Name of the OSS bucket used as the JindoFS storage backend.	`my-emr-bucket`
`oss.data-dir`	Directory within the bucket where JindoFS stores data. Created automatically on first write — do not create it in advance.	`jindoFS-1`
`oss.access.endpoint`	Endpoint of the region where the OSS bucket resides.	`oss-cn-hangzhou-internal.aliyuncs.com`
`oss.access.key`	AccessKey ID for OSS access. Omit if the bucket is in the same region as the cluster (password-free access).	—
`oss.access.secret`	AccessKey secret for OSS access. Omit if the bucket is in the same region as the cluster (password-free access).	—

config

Save and deploy the configuration.
Restart all SmartData components for the changes to take effect.

Configure at cluster creation

Pass Bigboot parameters as custom configurations when creating the cluster. The cluster automatically applies the settings after startup.

On the cluster creation page, enable Custom Software Settings.

In the Advanced Settings section, add the following JSON. Replace the values with your OSS bucket name and data directory.

[
    {
        "ServiceName": "BIGBOOT",
        "FileName": "bigboot",
        "ConfigKey": "oss.data-dir",
        "ConfigValue": "jindoFS-1"
    },
    {
        "ServiceName": "BIGBOOT",
        "FileName": "bigboot",
        "ConfigKey": "oss.access.bucket",
        "ConfigValue": "oss-bucket-name"
    }
]

kerbernets

Complete cluster creation. SmartData restarts automatically and applies the configuration.

Use JindoFS

JindoFS uses the jfs:// URI prefix, with the same command syntax as HDFS.

List a directory:

hadoop fs -ls jfs:///

Create a directory:

hadoop fs -mkdir jfs:///test-dir

Upload a file:

hadoop fs -put test.log jfs:///test-dir/

Data can be read from JindoFS only when Hadoop, Hive, and Spark jobs are running in the EMR cluster.

Manage disk space

JindoFS caches data on local disks to accelerate reads. Because local disk capacity is finite, JindoFS automatically evicts cold data using a high-watermark/low-watermark mechanism.

Configure the watermark parameters in Bigboot:

Parameter	Description
`node.data-dirs.watermark.high.ratio`	Upper limit of disk space usage per data disk (0–1). When usage reaches this ratio, JindoFS starts evicting the least recently accessed local blocks.
`node.data-dirs.watermark.low.ratio`	Lower limit of disk space usage per data disk (0–1). Eviction continues until usage drops to this ratio.

By default, JindoFS uses the total capacity of all data disks. Set the high ratio above the low ratio — reversed values cause configuration errors.

Configure storage policies

JindoFS provides four storage policies that control how many copies of a file are kept in OSS and on local disks. The default policy is WARM.

Policy	OSS backup	Local backups	Best for
COLD	Yes	None	Archival data accessed rarely
WARM	Yes	1	General-purpose workloads (default)
HOT	Yes	Multiple	Frequently read hot data
TEMP	No	1	Temporary data; faster I/O, lower reliability

New files inherit the storage policy of their parent directory.

Set a storage policy:

jindo dfsadmin -R -setStoragePolicy <path> <policy>

<path>: Target directory.
<policy>: One of COLD, WARM, HOT, or TEMP.
-R: Applies the policy recursively to all subdirectories.

Get the current storage policy:

jindo dfsadmin -getStoragePolicy <path>

Archive cold data (evict local blocks):

Use the archive command to explicitly evict local blocks while retaining the OSS copy. This is useful for partitioned tables where older partitions are no longer accessed frequently.

jindo dfsadmin -archive <path>

For example, if Hive partitions a table by day, run archive weekly on directories older than seven days to free local disk space without deleting data from OSS.

What's next

Use JindoFS in EMR V3.22.0 to V3.26.3