All Products
Search
Document Center

Object Storage Service:Use Hive on an EMR cluster to process data stored in OSS-HDFS

Last Updated:Sep 28, 2023

This topic describes how to use Hive on an E-MapReduce (EMR) cluster to process data stored in OSS-HDFS.

Prerequisites

  • A cluster of EMR V3.42.0 or later, or EMR V5.8.0 or later is created, with the Hive service selected. For more information, see Create a cluster.

  • OSS-HDFS is enabled for a bucket and access permissions on OSS-HDFS are granted. For more information about how to enable OSS-HDFS, see Enable OSS-HDFS and grant access permissions.

Procedure

  1. Log on to the EMR cluster.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. Click the EMR cluster that you created.

    3. Click the Nodes tab, and then click p480359.jpg on the left side of the node group.

    4. Click the ID of the ECS instance. On the Instances page, click Connect next to the instance ID.

    For more information about how to log on to a cluster in Windows or Linux by using an SSH key pair or SSH password, see Log on to a cluster.

  2. Use the following command on the terminal to connect to Hive.

    For other connection methods, see Connect to Hive.

    hive
  3. Use the following command on the terminal to connect to Hive.

    For other connection methods, see Connect to Hive.

    hive
  4. Add partitions to an existing table.

    You can add partitions to an existing table to store the data of the table in smaller units. You can specify query conditions based on partitions. This way, only partitions that meet the specified conditions are scanned and the query performance is improved.

    • Command syntax

      ALTER TABLE <table_name> ADD [IF NOT EXISTS] PARTITION <pt_spec> [PARTITION <pt_spec> PARTITION <pt_spec>...] LOCATION 'location';

      The following table describes the parameters in the preceding command.

      Parameter

      Required

      Description

      table_name

      Yes

      The name of the table to which you want to add partitions.

      IF NOT EXISTS

      No

      The parameter that causes the error to be skipped if the table contains a partition with the same name. If the IF NOT EXISTS parameter is not specified and a partition whose name is the same as the partition that you want to add already exists, an operation failure occurs and an error is returned.

      pt_spec

      Yes

      The partitions that you want to add. The value of this parameter is in the (partition_col1 = partition_col_value1, partition_col2 = partition_col_value2, ...) format. In this format, partition_col is the names of partition key columns, and partition_col_value is their values. The names of partition key columns are case-insensitive and their values are case-sensitive.

      location

      Yes

      The OSS path that is used to store data in a partition.

    • Examples

      The following sample code provides an example on how to add a partition to a table named sale_detail to store the sale records in the China (Hangzhou) region in December 2021 and specify the OSS path that is used to store data in the partition:

      ALTER TABLE sale_detail ADD IF NOT EXISTS PARTITION (sale_date='202112', region='hangzhou') LOCATION 'oss://examplebucket.cn-hangzhou.oss-dls.aliyuncs.com/path/2021/';