All Products
Search
Document Center

Object Storage Service:Use OSS-HDFS in EMR Hive or Spark

Last Updated:Sep 21, 2023

OSS-HDFS (JindoFS) is a storage service that supports cache-based acceleration and Ranger authentication. OSS-HDFS is available for clusters of the following versions: E-MapReduce (EMR) V3.42 or a later minor version and EMR V5.8.0 or a later minor version. Clusters that use OSS-HDFS as the backend storage provide better performance in big data extract, transform, and load (ETL) scenarios and allow you to smoothly migrate data from HDFS to OSS-HDFS. This topic describes how to use OSS-HDFS in EMR Hive or Spark.

Prerequisites

Background information

OSS-HDFS is a cloud-native data lake storage service. OSS-HDFS provides unified metadata management capabilities and is fully compatible with the HDFS API. OSS-HDFS also supports Portable Operating System Interface (POSIX). OSS-HDFS allows you to manage data in various data lake-based computing scenarios in the big data and AI fields. For more information, see Overview.

Procedure

Note This section describes how to use OSS-HDFS in EMR Hive. You can also use OSS-HDFS in EMR Spark by following the instructions in this topic.
  1. Log on to the EMR cluster. For more information, see Log on to a cluster.
  2. Create a Hive table in a directory of OSS-HDFS.

    1. Run the following command to open the Hive CLI:
      hive
    2. Run the following command to create a database in a directory of OSS-HDFS:

      CREATE DATABASE if not exists dw LOCATION 'oss://<yourBucketName>.<yourBucketEndpoint>/<path>';
      Note
      • In the preceding command, the dw is the database name, the <path> is any path, and the <yourBucketName>.<yourBucketEndpoint> is the domain name of the bucket for which OSS-HDFS is enabled.

      • In this example, the bucket domain name of OSS-HDFS is used as the prefix of the path. If you want to use only the bucket name to point to the directory of OSS-HDFS, you can specify a bucket-level endpoint or a global endpoint. For more information, see Appendix 1: Other methods used to configure the endpoint of OSS-HDFS.
    3. Run the following command to use the new database:
      use dw;
    4. Run the following command to create a Hive table in the new database:
      CREATE TABLE IF NOT EXISTS employee(eid int, name String,salary String,destination String)
      COMMENT 'Employee details';
  3. Insert data into the Hive table.
    Execute the following SQL statement to write data to the Hive table. An EMR job is generated.
    INSERT INTO employee(eid, name, salary, destination) values(1, 'liu hua', '100.0', '');
  4. Verify the data in the Hive table.
    SELECT * FROM employee WHERE eid = 1;
    The returned information contains the inserted data.
    OK
    1       liu hua 100.0
    Time taken: 12.379 seconds, Fetched: 1 row(s)