All Products
Search
Document Center

E-MapReduce:Use SmartData in EMR 3.20.0 to 3.22.0 (3.22.0 excluded)

Last Updated:Mar 26, 2026

JindoFS is a cloud-native file system for E-MapReduce (EMR) that uses Object Storage Service (OSS) as its storage backend, combined with local disk caching and a dedicated metadata service. Use JindoFS when you need HDFS-level metadata performance and elastic storage that scales independently of your compute cluster.

Note

This topic covers EMR V3.20.0 to V3.22.0 (V3.22.0 excluded). For EMR V3.22.0 and later, see Use JindoFS in EMR V3.22.0 to V3.26.3.

How it works

signal_path

JindoFS uses two internal services to separate storage from metadata:

  • Storage Service writes data to OSS, ensuring high reliability through OSS's built-in redundancy. Frequently accessed data is also cached on local cluster disks to accelerate reads.

  • Namespace Service manages file metadata locally. Metadata queries go to Namespace Service instead of OSS, giving JindoFS metadata query performance similar to Hadoop Distributed File System (HDFS).

JindoFS supports two storage modes: block storage mode and cache mode. This topic covers block storage mode.

Choose a storage system

EMR provides three storage systems: OssFileSystem, HDFS, and JindoFS. OssFileSystem and JindoFS both use OSS as the storage backend.

Feature Hadoop OSS EMR OssFileSystem EMR HDFS EMR JindoFS
Storage capacity Tremendous Tremendous Depends on cluster scale Tremendous
Reliability High High High High
Throughput factor Server Disk I/O of caches Disk I/O Disk I/O
Metadata query efficiency Low Medium High High
Scale-out Easy Easy Easy Easy
Scale-in Easy Easy Node decommission required Easy
Data locality None Weak Strong Medium

When to use JindoFS (block storage mode):

  • High metadata query volume: JindoFS delivers HDFS-level metadata performance, making it suitable for workloads that list, stat, or rename files frequently.

  • Large-scale data with elastic clusters: Storage scales independently of the cluster. Scale the cluster in or out without decommissioning nodes or migrating data.

  • Write Once Read Many (WORM) workloads: Local disk caching accelerates repeated reads on a fixed dataset.

  • Data locality matters: JindoFS stores local backups on the nodes that originally wrote the data, reducing network I/O during subsequent reads.

Prerequisites

Before you begin, ensure that you have:

  • An EMR cluster running V3.20.0 to V3.22.0 (V3.22.0 excluded), with SmartData and Bigboot selected during cluster creation. For instructions, see Create a cluster.

  • An OSS bucket. Store the bucket in the same region as your EMR cluster to ensure high performance and stability, and to enable password-free access without configuring an AccessKey pair.

Note

Bigboot provides distributed data management and component management services. SmartData builds on Bigboot to expose JindoFS to the application layer.

Configure JindoFS

Two configuration methods are available. Use the post-creation method if the cluster already exists, or the pre-creation method to automate configuration at cluster startup.

Configure after cluster creation

  1. In the EMR console, go to the Bigboot configuration page for your cluster.

  2. Set the following parameters. Parameters framed in red in the console are required.

    Note

    JindoFS supports multiple namespaces. The examples in this topic use a namespace named test.

    Parameter Description Example
    oss.access.bucket Name of the OSS bucket used as the JindoFS storage backend. my-emr-bucket
    oss.data-dir Directory within the bucket where JindoFS stores data. Created automatically on first write — do not create it in advance. jindoFS-1
    oss.access.endpoint Endpoint of the region where the OSS bucket resides. oss-cn-hangzhou-internal.aliyuncs.com
    oss.access.key AccessKey ID for OSS access. Omit if the bucket is in the same region as the cluster (password-free access).
    oss.access.secret AccessKey secret for OSS access. Omit if the bucket is in the same region as the cluster (password-free access).

    config

  3. Save and deploy the configuration.

  4. Restart all SmartData components for the changes to take effect.

    service

Configure at cluster creation

Pass Bigboot parameters as custom configurations when creating the cluster. The cluster automatically applies the settings after startup.

  1. On the cluster creation page, enable Custom Software Settings.

  2. In the Advanced Settings section, add the following JSON. Replace the values with your OSS bucket name and data directory.

    [
        {
            "ServiceName": "BIGBOOT",
            "FileName": "bigboot",
            "ConfigKey": "oss.data-dir",
            "ConfigValue": "jindoFS-1"
        },
        {
            "ServiceName": "BIGBOOT",
            "FileName": "bigboot",
            "ConfigKey": "oss.access.bucket",
            "ConfigValue": "oss-bucket-name"
        }
    ]

    kerbernets

  3. Complete cluster creation. SmartData restarts automatically and applies the configuration.

Use JindoFS

JindoFS uses the jfs:// URI prefix, with the same command syntax as HDFS.

List a directory:

hadoop fs -ls jfs:///

Create a directory:

hadoop fs -mkdir jfs:///test-dir

Upload a file:

hadoop fs -put test.log jfs:///test-dir/

Data can be read from JindoFS only when Hadoop, Hive, and Spark jobs are running in the EMR cluster.

Manage disk space

JindoFS caches data on local disks to accelerate reads. Because local disk capacity is finite, JindoFS automatically evicts cold data using a high-watermark/low-watermark mechanism.

Configure the watermark parameters in Bigboot:

Parameter Description
node.data-dirs.watermark.high.ratio Upper limit of disk space usage per data disk (0–1). When usage reaches this ratio, JindoFS starts evicting the least recently accessed local blocks.
node.data-dirs.watermark.low.ratio Lower limit of disk space usage per data disk (0–1). Eviction continues until usage drops to this ratio.

By default, JindoFS uses the total capacity of all data disks. Set the high ratio above the low ratio — reversed values cause configuration errors.

Configure storage policies

JindoFS provides four storage policies that control how many copies of a file are kept in OSS and on local disks. The default policy is WARM.

Policy OSS backup Local backups Best for
COLD Yes None Archival data accessed rarely
WARM Yes 1 General-purpose workloads (default)
HOT Yes Multiple Frequently read hot data
TEMP No 1 Temporary data; faster I/O, lower reliability

New files inherit the storage policy of their parent directory.

Set a storage policy:

jindo dfsadmin -R -setStoragePolicy <path> <policy>
  • <path>: Target directory.

  • <policy>: One of COLD, WARM, HOT, or TEMP.

  • -R: Applies the policy recursively to all subdirectories.

Get the current storage policy:

jindo dfsadmin -getStoragePolicy <path>

Archive cold data (evict local blocks):

Use the archive command to explicitly evict local blocks while retaining the OSS copy. This is useful for partitioned tables where older partitions are no longer accessed frequently.

jindo dfsadmin -archive <path>

For example, if Hive partitions a table by day, run archive weekly on directories older than seven days to free local disk space without deleting data from OSS.

What's next