All Products
Search
Document Center

E-MapReduce:Use OSS-HDFS in EMR Hive or Spark

Last Updated:Jul 26, 2023

OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.

Background information

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see Overview.

Prerequisites

An EMR cluster is created . For more information, see Create a cluster.

Procedure

  1. Step 1: Enable OSS-HDFS
  2. Step 2: Obtain the bucket domain name of OSS-HDFS
  3. Step 3: Use OSS-HDFS in the EMR cluster

Step 1: Enable OSS-HDFS

Enable OSS-HDFS and obtain the permissions to access OSS-HDFS. For more information, see Enable OSS-HDFS and grant access permissions.

Step 2: Obtain the bucket domain name of OSS-HDFS

On the Overview page of your bucket in the OSS console, obtain the bucket domain name of OSS-HDFS. The bucket domain name is required when you create a Hive table in Step 3: Use OSS-HDFS in the EMR cluster. HDFS Endpoint

Step 3: Use OSS-HDFS in the EMR cluster

Note This section describes how to use OSS-HDFS in EMR Hive. You can also use OSS-HDFS in EMR Spark by following the instructions in this topic.
  1. Log on to the EMR cluster. For more information, see Log on to a cluster.
  2. Create a Hive table in a directory of OSS-HDFS.
    1. Run the following command to open the Hive CLI:
      hive
    2. Run the following command to create a database in a directory of OSS-HDFS:
      CREATE DATABASE if not exists dw LOCATION 'oss://<yourBucketName>.<yourBucketEndpoint>/<path>';
      Note
    3. Run the following command to use the new database:
      use dw;
    4. Run the following command to create a Hive table in the new database:
      CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String)
      COMMENT 'Employee details';
    5. Run the following command to query the information about the table:
      desc formatted employee;
      The following information is returned. The value of the Location parameter indicates that the Hive table is created in the directory of OSS-HDFS.
      # col_name              data_type               comment
      
      eid                     int
      name                    string
      salary                  string
      destination             string
      
      # Detailed Table Information
      Database:               dw
      Owner:                  root
      CreateTime:             Fri May 06 16:40:06 CST 2022
      LastAccessTime:         UNKNOWN
      Retention:              0
      Location:               oss://****.cn-hangzhou.oss-dls.aliyuncs.com/dw/employee
      Table Type:             MANAGED_TABLE
  3. Insert data into the Hive table.
    Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
    INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
  4. Verify the data in the Hive table.
    SELECT * FROM employee WHERE eid = 1;
    The returned information contains the inserted data.
    OK
    1       liu hua 100.0
    Time taken: 12.379 seconds, Fetched: 1 row(s)

Grant access permissions to an EMR cluster

If the default role AliyunECSInstanceForEMRRole is not used by your EMR cluster, you must grant the EMR cluster the permissions to access OSS-HDFS.

If the default role AliyunECSInstanceForEMRRole is used by your EMR cluster, you do not need to grant the EMR cluster the permissions to access OSS-HDFS. By default, the policy AliyunECSInstanceForEMRRolePolicy is attached to the role, and the policy contains the oss:PostDataLakeStorageFileOperation permission.