All Products
Search
Document Center

E-MapReduce:Use OSS-HDFS in EMR Hive or Spark

Last Updated:Mar 26, 2026

OSS-HDFS (JindoFS) is a cloud-native data lake storage service that is fully compatible with the Hadoop Distributed File System (HDFS) API. It supports cache-based acceleration, Ranger authentication, and POSIX. Use it as the backend storage for your E-MapReduce (EMR) Hive or Spark workloads to improve performance in big data extract, transform, and load (ETL) scenarios and to smoothly migrate data from HDFS to OSS-HDFS.

OSS-HDFS is available on EMR V3.42 or later minor versions and EMR V5.8.0 or later minor versions.

Prerequisites

Before you begin, make sure you have:

Access control

OSS-HDFS access from an EMR cluster is controlled through RAM roles. The default role AliyunECSInstanceForEMRRole already has the oss:PostDataLakeStorageFileOperation permission through its attached policy AliyunECSInstanceForEMRRolePolicy — no additional configuration is needed.

If your cluster uses a custom role instead of the default role, grant the custom role the oss:PostDataLakeStorageFileOperation permission before proceeding.

OSS-HDFS endpoint format

All Hive and Spark commands that reference OSS-HDFS use the following URI format:

oss://<bucket-name>.<oss-hdfs-endpoint>/<path>

For example:

oss://my-bucket.cn-hangzhou.oss-dls.aliyuncs.com/warehouse/

Get the endpoint from the Overview page of your bucket in the OSS console.

HDFS Endpoint
Note To use the bucket name alone — without the full endpoint — configure a bucket-level or global endpoint instead. See Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.

Use OSS-HDFS in Hive

The following steps show how to create a Hive database and table backed by OSS-HDFS, insert data, and verify the result. The same Hive URI format applies to Spark — replace <bucket-name>.<oss-hdfs-endpoint> with your actual endpoint in any Spark job that reads from or writes to OSS-HDFS.

  1. Log on to the EMR cluster. See Log on to a cluster.

  2. Open the Hive CLI:

    hive
  3. Create a database in OSS-HDFS:

    CREATE DATABASE IF NOT EXISTS dw
    LOCATION 'oss://<bucket-name>.<oss-hdfs-endpoint>/<path>';

    Replace the following placeholders:

    PlaceholderDescription
    <bucket-name>Name of your OSS bucket
    <oss-hdfs-endpoint>OSS-HDFS endpoint obtained from the OSS console, for example cn-hangzhou.oss-dls.aliyuncs.com
    <path>Directory path within the bucket
  4. Switch to the new database:

    USE dw;
  5. Create a Hive table in the database:

    CREATE TABLE IF NOT EXISTS employee (
      eid         INT,
      name        STRING,
      salary      STRING,
      destination STRING
    )
    COMMENT 'Employee details';
  6. Verify that the table location points to OSS-HDFS:

    DESC FORMATTED employee;

    The Location field in the output confirms that the table resides in OSS-HDFS:

    Location:   oss://****.cn-hangzhou.oss-dls.aliyuncs.com/dw/employee
    Table Type: MANAGED_TABLE
  7. Insert a row:

    INSERT INTO employee (eid, name, salary, destination)
    VALUES (1, 'liu hua', '100.0', '');

    EMR generates a job to execute the insert.

  8. Query the inserted data:

    SELECT * FROM employee WHERE eid = 1;

    Expected output:

    OK
    1       liu hua 100.0
    Time taken: 12.379 seconds, Fetched: 1 row(s)

What's next