Use OSS-HDFS in EMR Hive or Spark - E-MapReduce - Alibaba Cloud Documentation Center

OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.

Background information

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see Overview.

Prerequisites

An EMR cluster is created . For more information, see Create a cluster.

Procedure

Step 1: Enable OSS-HDFS
Step 2: Obtain the bucket domain name of OSS-HDFS
Step 3: Use OSS-HDFS in the EMR cluster

Step 1: Enable OSS-HDFS

Enable OSS-HDFS and obtain the permissions to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.

Step 2: Obtain the bucket domain name of OSS-HDFS

On the Overview page of your bucket in the OSS console, obtain the bucket domain name of OSS-HDFS. The bucket domain name is required when you create a Hive table in Step 3: Use OSS-HDFS in the EMR cluster. HDFS Endpoint

Step 3: Use OSS-HDFS in the EMR cluster

Note This section describes how to use OSS-HDFS in EMR Hive. You can also use OSS-HDFS in EMR Spark by following the instructions in this topic.

Log on to the EMR cluster. For more information, see Log on to a cluster.
Create a Hive table in a directory of OSS-HDFS.
1. Run the following command to open the Hive CLI:
```
hive
```
2. Run the following command to create a database in a directory of OSS-HDFS:
```
CREATE DATABASE if not exists dw LOCATION 'oss://<yourBucketName>.<yourBucketEndpoint>/<path>';
```
  Note
  In the command, dw is the database name, <path> specifies a random path, and <yourBucketName>.<yourBucketEndpoint> specifies the bucket domain name of OSS-HDFS that you obtained in Step 2: Obtain the bucket domain name of OSS-HDFS.
  In this example, the bucket domain name of OSS-HDFS is used as the prefix of the path. If you want to use only the bucket name to point to the directory of OSS-HDFS, you can specify a bucket-level endpoint or a global endpoint. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
3. Run the following command to use the new database:
```
use dw;
```
4. Run the following command to create a Hive table in the new database:
```
CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String)
COMMENT 'Employee details';
```
5. Run the following command to query the information about the table:
```
desc formatted employee;
```
  The following information is returned. The value of the Location parameter indicates that the Hive table is created in the directory of OSS-HDFS.
```
# col_name              data_type               comment

eid                     int
name                    string
salary                  string
destination             string

# Detailed Table Information
Database:               dw
Owner:                  root
CreateTime:             Fri May 06 16:40:06 CST 2022
LastAccessTime:         UNKNOWN
Retention:              0
Location:               oss://****.cn-hangzhou.oss-dls.aliyuncs.com/dw/employee
Table Type:             MANAGED_TABLE
```
Insert data into the Hive table.
Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
```
INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
```

Verify the data in the Hive table.

SELECT * FROM employee WHERE eid = 1;

The returned information contains the inserted data.

OK
1       liu hua 100.0
Time taken: 12.379 seconds, Fetched: 1 row(s)

Grant access permissions to an EMR cluster

If the default role AliyunECSInstanceForEMRRole is not used by your EMR cluster, you must grant the EMR cluster the permissions to access OSS-HDFS.

If the default role AliyunECSInstanceForEMRRole is used by your EMR cluster, you do not need to grant the EMR cluster the permissions to access OSS-HDFS. By default, the policy AliyunECSInstanceForEMRRolePolicy is attached to the role, and the policy contains the oss:PostDataLakeStorageFileOperation permission.