Use OSS-HDFS in EMR Hive or Spark - Object Storage Service

OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.

Prerequisites

A cluster of EMR V3.42.0 or later, or EMR V5.8.0 or later is created. For more information, see Create a cluster.
OSS-HDFS is enabled for a bucket and permissions are granted to access OSS-HDFS. For more information about how to enable OSS-HDFS, see Enable OSS-HDFS and grant access permissions.

Background information

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see Overview.

Procedure

Note This section describes how to use OSS-HDFS in EMR Hive. You can also use OSS-HDFS in EMR Spark by following the instructions in this topic.

Log on to the EMR cluster. For more information, see Log on to a cluster.
Create a Hive table in a directory of OSS-HDFS.
1. Run the following command to open the Hive CLI:
```
hive
```
2. Run the following command to create a database in a directory of OSS-HDFS:
```
CREATE DATABASE if not exists dw LOCATION 'oss://<yourBucketName>.<yourBucketEndpoint>/<path>';
```
  Note
  In the preceding command, the dw is the database name, the <path> is any path, and the <yourBucketName>.<yourBucketEndpoint> is the domain name of the bucket for which OSS-HDFS is enabled.
  In this example, the bucket domain name of OSS-HDFS is used as the prefix of the path. If you want to use only the bucket name to point to the directory of OSS-HDFS, you can specify a bucket-level endpoint or a global endpoint. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
3. Run the following command to use the new database:
```
use dw;
```
4. Run the following command to create a Hive table in the new database:
```
CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String)
COMMENT 'Employee details';
```
Insert data into the Hive table.
Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
```
INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
```

Verify the data in the Hive table.

SELECT * FROM employee WHERE eid = 1;

The returned information contains the inserted data.

OK
1       liu hua 100.0
Time taken: 12.379 seconds, Fetched: 1 row(s)