Access OSS-HDFS from EMR Hive or Spark - Object Storage Service

OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.

Prerequisites

OSS-HDFS is enabled for a bucket and permissions are granted to a RAM role to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.
By default, an Alibaba Cloud account has the permissions to connect EMR clusters to OSS-HDFS and perform common operations related to OSS-HDFS. A RAM user that is granted the required permissions is created. If you want to use a RAM user to connect EMR clusters to OSS-HDFS, the RAM user must have the required permissions. For more information, see Grant a RAM user permissions to connect EMR clusters to OSS-HDFS.

Background information

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see What is OSS-HDFS?.

Procedure

Note This section describes how to use OSS-HDFS in EMR Hive. You can also use OSS-HDFS in EMR Spark by following the instructions in this topic.

Log on to the EMR cluster. For more information, see Log on to a cluster.
Create a Hive table that points to OSS-HDFS.
1. Run the following command to open the Hive CLI:
```
hive
```
2. Run the following command to create a database that points to OSS-HDFS.
```
CREATE DATABASE if not exists dw LOCATION 'oss://{yourHdfsBucketDomain}/{path}';
```
  Parameter description:
  - dw: The database name. You can customize this name.
  - {path}: The path in OSS-HDFS to store the database. You can customize this path.
  - {yourHdfsBucketDomain}: The bucket domain name for the OSS-HDFS service.
    - To retrieve the domain name, log on to the OSS console. Navigate to the target bucket. On the Overview tab, in the Access Ports section, copy the full bucket domain name that corresponds to the HDFS Service.
  Note
  This example uses the OSS-HDFS domain name as the path prefix. If you want to use only the bucket name to point to OSS-HDFS, you can configure a bucket-level endpoint or a global endpoint. For more information, see Appendix 1: Other ways to configure an endpoint.
3. Run the following command to use the new database:
```
use dw;
```
4. Run the following command to create a Hive table in the new database:
```
CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String)
COMMENT 'Employee details';
```
Insert data into the Hive table.
Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
```
INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
```

Verify the data in the Hive table.

SELECT * FROM employee WHERE eid = 1;

The returned information contains the inserted data.

OK
1       liu hua 100.0
Time taken: 12.379 seconds, Fetched: 1 row(s)