All Products
Search
Document Center

MaxCompute:Reading EMR Hive data based on HMS+HDFS

Last Updated:Jul 19, 2025

This topic describes how to create an external schema in MaxCompute and query Hive table data in E-MapReduce (EMR).

Prerequisites

You have created a MaxCompute project and enabled the project-level metadata support for schemas for the MaxCompute project (the target project where you will create the external schema).

Limits

  • The data lakehouse solution is supported only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.

  • The VPC where MaxCompute is deployed must be in the same region as the VPC where the EMR cluster is deployed.

Step 1: Prepare Hive data

  1. Log on to the EMR on ECS console, create an EMR cluster, and connect to the instance.

    1. When you create an EMR cluster, pay attention to the following configuration items:

      Configuration section

      Configuration item

      Description

      Example

      Software Configuration

      Business Scenario

      Select a business scenario as needed.

      Custom cluster

      Edition

      Select an EMR version that is built based on Hadoop and Hive.

      EMR-3.53.0

      Optional Services

      Select the Hadoop, HDFS, Hive, YARN, and ZooKeeper components. The related service processes are started by default for the selected components.

      Hadoop-Common, HDFS, Hive, YARN, Spark3, ZooKeeper

      Metadata

      Select Self-managed RDS and configure the database connection parameters. For more information, see Configure self-managed RDS.

      If your Hive+HDFS external data source is in an EMR cluster, you can use only self-managed RDS as the metadatabase. Data Lake Formation (DLF) unified metadata and built-in MySQL are not supported. Before you read data, make sure that the Hive component in the EMR cluster is running properly.

      Self-managed RDS

    2. After the cluster is created, click Nodes in the Actions column of the cluster.

    3. On the Nodes tab, click the ID of the destination node in the emr-master node group to go to the Elastic Compute Service (ECS) console.

    4. Select a tool to connect to the ECS instance. For more information, see Connection methods.

      Note

      In this example, Workbench is used to connect to the instance. The logon password is the password that you configure when you create the cluster.

  2. Access Hive and prepare data. Log on to the primary node of the cluster using Workbench. In the terminal interface, enter hive to enter the Hive mode, and then run the following commands:

    1. Create a non-partitioned table and write data to the table.

      -- Create a database
      CREATE DATABASE IF NOT EXISTS myhive;
      
      -- Switch to the database
      USE myhive;
      
      -- Create a non-partitioned table
      CREATE TABLE IF NOT EXISTS employees (
        id INT,
        name STRING,
        age INT,
        department STRING
      )
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE;
      
      -- Insert data
      INSERT INTO employees VALUES(1, 'John', 25, 'Sales'),(2, 'Jane', 30, 'Marketing'),(3, 'Mike', 35, 'Engineering'),(4, 'Sarah', 28, 'HR'),(5, 'David', 32, 'Finance'),(6, 'Linda', 29, 'IT'),(7, 'Robert', 31, 'Operations'),(8, 'Emily', 27, 'Research'),(9, 'Michael', 33, 'Development'),(10, 'Chris', 26, 'Support');
      
      -- Query the table data
      SELECT * FROM employees;

      Output:

      1       John    25      Sales
      2       Jane    30      Marketing
      3       Mike    35      Engineering
      4       Sarah   28      HR
      5       David   32      Finance
      6       Linda   29      IT
      7       Robert  31      Operations
      8       Emily   27      Research
      9       Michael 33      Development
      10      Chris   26      Support
    2. Create a partitioned table and write data to the table.

      -- Create a partitioned table
      CREATE TABLE employees_pt (
          id INT,
          name STRING,
          age INT
      )
      PARTITIONED BY (department STRING)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
      
      -- Insert data
      INSERT INTO employees_pt PARTITION (department='Sales') VALUES(1, 'John', 25),(2, 'Jane', 30),(3, 'Mike', 35);
      INSERT INTO employees_pt PARTITION (department='Marketing') VALUES(4, 'Sarah', 28),(5, 'David', 32);
      INSERT INTO employees_pt PARTITION (department='Engineering') VALUES(6, 'Linda', 29),(7, 'Robert', 31);
      INSERT INTO employees_pt PARTITION (department='HR') VALUES(8, 'Emily', 27),(9, 'Michael', 33),(10, 'Chris', 26);
      
      -- Query the table data
      SELECT * FROM employees_pt;

      Output:

      6       Linda   29      Engineering
      7       Robert  31      Engineering
      8       Emily   27      HR
      9       Michael 33      HR
      10      Chris   26      HR
      4       Sarah   28      Marketing
      5       David   32      Marketing
      1       John    25      Sales
      2       Jane    30      Sales
      3       Mike    35      Sales

Step 2: Create a Hive+HDFS external data source in MaxCompute

  1. Log on to the MaxCompute console and select a region in the upper-left corner.

  2. In the navigation pane on the left, choose Tenant Management > Foreign Server.

  3. On the Foreign Server page, click Create Foreign Server.

  4. In the Create Foreign Server dialog box, configure the following parameters:

    Parameter

    Description

    Example

    Foreign Server Type

    Select Hive+HDFS.

    Hive+HDFS

    Foreign Server Name

    You can specify a custom name. The naming conventions are as follows:

    • The name contains lowercase letters, underscores (_), and digits and starts with a lowercase letter.

    • The name can be up to 128 characters in length.

    hive_fs

    Foreign Server Description

    Enter the information as needed.

    None

    Network Connection Object

    The name of the network connection. You can select or establish a connection between MaxCompute and the VPC of an EMR or Hadoop cluster. For more information about the parameters, see the Create a network connection between MaxCompute and the target VPC section in Access a VPC (direct connection).

    Note
    • For more information about the basic concepts of network connections, see Networklink.

    • The VPC must be deployed in the same region as the MaxCompute foreign server and the project to which the foreign server is mounted.

    networklink

    Cluster Name

    The name of the cluster. For a high availability (HA) Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. To obtain the cluster name, perform the following operations:

    1. Log on to the EMR console and click the Cluster ID to go to the cluster details page.

    2. On the Cluster Services tab, click Configure for the HDFS service to go to the Configuration page.

    3. Click the hdfs-site.xml tab. In the Configuration Item Name column, search for dfs.nameservices. The value of this configuration item is the cluster name.

    hdfs-cluster

    NameNode Address

    The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. To obtain the NameNode address, perform the following operations:

    1. Log on to the EMR console and click the Cluster ID to go to the cluster details page.

    2. On the Cluster Services tab, click HDFS to go to the Status page.

    3. In the Component List area, click the image icon next to NameNode to expand the topology list.

    4. Obtain the Internal IP of the Master-1-1 node. The format of the NameNode Address is Internal IP:8020.

    172.22.xx.xxx:8020

    HMS Service Address

    The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. The format is Internal IP:9083.

    Note

    For information about how to obtain the internal IP address, see the description of the NameNode Address parameter.

    172.22.xx.xxx:9083

    Authentication Type

    Only No Authentication is supported.

    No Authentication

  5. Click OK to create the foreign server.

Step 3: Create an external schema in MaxCompute

Run the following command in the MaxCompute client to create an external schema:

Note

You must enable the project-level metadata support for schemas for the target MaxCompute project in advance.

SET odps.namespace.schema=true;

CREATE EXTERNAL SCHEMA IF NOT EXISTS es_hive
WITH  hive_fs
ON 'myhive' ;

Code description:

  • es_hive: the name of the external schema. You can customize the name.

  • hive_fs: the name of the foreign server that you created in Step 2. The project to which the external schema belongs must be in the same region as the foreign server.

  • myhive: the name of the Hive database that you created in Step 1.

Step 4: Use SQL to access Hadoop Hive data in the MaxCompute client

  1. Query the tables in the external schema in the MaxCompute client.

    SET odps.namespace.schema=true;
    SHOW TABLES IN es_hive;

    Output:

    ALIYUN$xxx:employees
    ALIYUN$xxx:employees_pt
  2. Query the details of the tables in the external schema.

    • Query the non-partitioned table.

      SELECT * FROM es_hive.employees;

      Output:

      +------------+------------+------------+------------+
      | id         | name       | age        | department  |
      +------------+------------+------------+------------+
      | 1          | John       | 25         | Sales      |
      | 2          | Jane       | 30         | Marketing  |
      | 3          | Mike       | 35         | Engineering|
      | 4          | Sarah      | 28         | HR         |
      | 5          | David      | 32         | Finance    |
      | 6          | Linda      | 29         | IT         |
      | 7          | Robert     | 31         | Operations |
      | 8          | Emily      | 27         | Research   |
      | 9          | Michael    | 33         | Development|
      | 10         | Chris      | 26         | Support    |
      +------------+------------+------------+------------+
    • Query the partitioned table.

      SELECT * FROM  es_hive.employees_pt WHERE department='HR';

      Output:

      +------------+------------+------------+------------+
      | id         | name       | age        | department |
      +------------+------------+------------+------------+
      | 8          | Emily      | 27         | HR         |
      | 9          | Michael    | 33         | HR         |
      | 10         | Chris      | 26         | HR         |
      +------------+------------+------------+------------+

Step 5: Add new data to the Hadoop data source

  1. Insert new partition data into the Hive partitioned table.

    Log on to the primary node of the cluster using Workbench. In the terminal interface, enter hive to enter the Hive mode, and then run the following commands:

    -- Switch to the created database and insert data into the employees_pt table
    USE myhive;
    INSERT INTO employees_pt PARTITION (department='Computer') VALUES(11, 'Cily', 29),(12, 'Miky', 35);
    
    -- Query the partitioned table in Hive
    SELECT * FROM employees_pt;

    Output:

    11      Cily    29      Computer
    12      Miky    35      Computer
    6       Linda   29      Engineering
    7       Robert  31      Engineering
    8       Emily   27      HR
    9       Michael 33      HR
    10      Chris   26      HR
    4       Sarah   28      Marketing
    5       David   32      Marketing
    1       John    25      Sales
    2       Jane    30      Sales
    3       Mike    35      Sales
  2. Query the new partition data in the MaxCompute client.

    SELECT * FROM es_hive.employees_pt WHERE department='Computer';

    Output:

    +------+------+------+------------+
    | id   | name | age  | department |
    +------+------+------+------------+
    | 11   | Cily | 29   | Computer   |
    | 12   | Miky | 35   | Computer   |
    +------+------+------+------------+

Step 6: Copy the data of the federated external table to the data warehouse in the MaxCompute client

-- Copy the data of the federated external table to the data warehouse
CREATE TABLE employees_copy AS SELECT * FROM es_hive.employees;

-- Query the copied data in the data warehouse table
SELECT * FROM employees_copy;

Output:

+------------+------------+------------+------------+
| id         | name       | age        | department  |
+------------+------------+------------+------------+
| 1          | John       | 25         | Sales      |
| 2          | Jane       | 30         | Marketing  |
| 3          | Mike       | 35         | Engineering|
| 4          | Sarah      | 28         | HR         |
| 5          | David      | 32         | Finance    |
| 6          | Linda      | 29         | IT         |
| 7          | Robert     | 31         | Operations |
| 8          | Emily      | 27         | Research   |
| 9          | Michael    | 33         | Development|
| 10         | Chris      | 26         | Support    |
+------------+------------+------------+------------+