OSS-HDFS (JindoFS) is a cloud-native data lake storage service that is fully compatible with the HDFS API and supports Portable Operating System Interface (POSIX). It provides unified metadata management capabilities, cache-based acceleration, and Apache Ranger authentication for big data extract, transform, and load (ETL) workloads on E-MapReduce (EMR) clusters. OSS-HDFS also allows you to smoothly migrate data from HDFS to OSS-HDFS. This topic shows you how to create a Hive database and table backed by OSS-HDFS, insert data, and verify the result. The same steps apply to EMR Spark.
Supported EMR versions: EMR V3.42 or later minor versions, and EMR V5.8.0 or later minor versions.
Prerequisites
Before you begin, make sure you have:
OSS-HDFS enabled for your bucket and the required access permissions granted to a RAM role. See Enable OSS-HDFS and grant access permissions.
Permission to connect EMR clusters to OSS-HDFS. Alibaba Cloud accounts have this permission by default. If you are using a RAM user, grant the required permissions first. See Grant a RAM user permissions to connect EMR clusters to OSS-HDFS.
The OSS-HDFS bucket domain name. To get it, go to the OSS console, open your bucket, and on the Overview tab, find the Access Ports section. Copy the full domain name listed under HDFS Service. Use this value as
<yourHdfsBucketDomain>in the commands below
Note: All path examples in this topic use the full OSS-HDFS domain name as the path prefix (the oss://<yourHdfsBucketDomain>/<path> format). If you prefer to use only the bucket name, configure a bucket-level endpoint or a global endpoint. See Other ways to configure an endpoint.Use OSS-HDFS in EMR Hive
Step 1: Log on to the EMR cluster
Log on to your EMR cluster. See Log on to a cluster.
Step 2: Create a Hive database and table backed by OSS-HDFS
Open the Hive CLI:
hiveCreate a database that stores its data in OSS-HDFS:
CREATE DATABASE IF NOT EXISTS dw LOCATION 'oss://<yourHdfsBucketDomain>/<path>';Parameter Description dwThe database name. Customize as needed. <yourHdfsBucketDomain>The OSS-HDFS bucket domain name from the Access Ports section of the OSS console. <path>The path in OSS-HDFS where the database is stored. Customize as needed. Switch to the new database:
USE dw;Create a Hive table in the database:
CREATE TABLE IF NOT EXISTS employee ( eid INT, name STRING, salary STRING, destination STRING ) COMMENT 'Employee details';
Step 3: Insert data
Run the following statement to insert a row. This generates an EMR job to write the data to OSS-HDFS.
INSERT INTO employee (eid, name, salary, destination)
VALUES (1, 'liu hua', '100.0', '');Step 4: Verify the data
Query the table to confirm the insert succeeded:
SELECT * FROM employee WHERE eid = 1;Expected output:
OK
1 liu hua 100.0
Time taken: 12.379 seconds, Fetched: 1 row(s)The row you inserted is returned, confirming that Hive is reading and writing data through OSS-HDFS correctly.
What's next
To learn more about OSS-HDFS and its capabilities, see What is OSS-HDFS?.
To configure alternative endpoint formats for accessing OSS-HDFS, see Other ways to configure an endpoint.