All Products
Search
Document Center

Object Storage Service:Access OSS-HDFS from EMR Hive or Spark

Last Updated:Mar 20, 2026

OSS-HDFS (JindoFS) is a cloud-native data lake storage service that is fully compatible with the HDFS API and supports Portable Operating System Interface (POSIX). It provides unified metadata management capabilities, cache-based acceleration, and Apache Ranger authentication for big data extract, transform, and load (ETL) workloads on E-MapReduce (EMR) clusters. OSS-HDFS also allows you to smoothly migrate data from HDFS to OSS-HDFS. This topic shows you how to create a Hive database and table backed by OSS-HDFS, insert data, and verify the result. The same steps apply to EMR Spark.

Supported EMR versions: EMR V3.42 or later minor versions, and EMR V5.8.0 or later minor versions.

Prerequisites

Before you begin, make sure you have:

  • OSS-HDFS enabled for your bucket and the required access permissions granted to a RAM role. See Enable OSS-HDFS and grant access permissions.

  • Permission to connect EMR clusters to OSS-HDFS. Alibaba Cloud accounts have this permission by default. If you are using a RAM user, grant the required permissions first. See Grant a RAM user permissions to connect EMR clusters to OSS-HDFS.

  • The OSS-HDFS bucket domain name. To get it, go to the OSS console, open your bucket, and on the Overview tab, find the Access Ports section. Copy the full domain name listed under HDFS Service. Use this value as <yourHdfsBucketDomain> in the commands below

Note: All path examples in this topic use the full OSS-HDFS domain name as the path prefix (the oss://<yourHdfsBucketDomain>/<path> format). If you prefer to use only the bucket name, configure a bucket-level endpoint or a global endpoint. See Other ways to configure an endpoint.

Use OSS-HDFS in EMR Hive

Step 1: Log on to the EMR cluster

Log on to your EMR cluster. See Log on to a cluster.

Step 2: Create a Hive database and table backed by OSS-HDFS

  1. Open the Hive CLI:

       hive
  2. Create a database that stores its data in OSS-HDFS:

       CREATE DATABASE IF NOT EXISTS dw LOCATION 'oss://<yourHdfsBucketDomain>/<path>';
    ParameterDescription
    dwThe database name. Customize as needed.
    <yourHdfsBucketDomain>The OSS-HDFS bucket domain name from the Access Ports section of the OSS console.
    <path>The path in OSS-HDFS where the database is stored. Customize as needed.
  3. Switch to the new database:

       USE dw;
  4. Create a Hive table in the database:

       CREATE TABLE IF NOT EXISTS employee (
         eid         INT,
         name        STRING,
         salary      STRING,
         destination STRING
       )
       COMMENT 'Employee details';

Step 3: Insert data

Run the following statement to insert a row. This generates an EMR job to write the data to OSS-HDFS.

INSERT INTO employee (eid, name, salary, destination)
VALUES (1, 'liu hua', '100.0', '');

Step 4: Verify the data

Query the table to confirm the insert succeeded:

SELECT * FROM employee WHERE eid = 1;

Expected output:

OK
1       liu hua 100.0
Time taken: 12.379 seconds, Fetched: 1 row(s)

The row you inserted is returned, confirming that Hive is reading and writing data through OSS-HDFS correctly.

What's next