Read Hadoop Hive Data via HMS and HDFS External Schema - MaxCompute

This tutorial walks you through querying Hive table data in a Hadoop cluster directly from MaxCompute — without moving the data. It uses Hive on E-MapReduce (EMR) as the example environment.

MaxCompute connects to your Hadoop cluster over a Virtual Private Cloud (VPC) network, reads Hive table metadata through the Hive Metastore Service (HMS), and accesses the underlying data files directly through the Hadoop Distributed File System (HDFS). Once you map a Hive database to an external schema in MaxCompute, you can query Hive tables using standard SQL — including partitioned tables and newly added partitions — without importing the data.

Supported regions

China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).

Prerequisites

Before you begin, make sure you have:

An activated E-MapReduce service and an EMR cluster with Hive data, or plan to create one in steps 1 and 2
A MaxCompute project in the same region as your Hadoop cluster
Schema support enabled on the MaxCompute project (see step 1)
A VPC network connection from MaxCompute to the EMR cluster VPC (configured in step 3)

Step 1: Enable schema support

Skip this step if schema support is already enabled on your MaxCompute project.

Log on to the MaxCompute console and select a region in the upper-left corner.
In the left navigation pane, choose Manage Configurations > Projects.
On the Projects page, find the target project and click Enable Schema in the Actions column.

Step 2: Prepare Hive data

Skip this step if you already have Hive data in your EMR cluster.

Create an EMR cluster

Log on to the E-MapReduce console and select a region in the upper-left corner.

On the My Clusters page, click Create Cluster and configure the following settings: For a complete guide on creating a cluster, see Create and use a DataLake cluster.

Software configuration

Parameter	Description
Region	Select the same region as your MaxCompute project. The region cannot be changed after the cluster is created.
Business Scenario	Select Custom Cluster.
Product Version	Select the latest version.
High Service Availability	Optional. When enabled, EMR distributes master nodes across different underlying hardware to reduce failure risk.
Optional Services (Select One At Least)	Select `HADOOP-COMMON(3.2.1)`, `HDFS(3.2.1)`, `HIVE(3.1.3)`, and `YARN(3.2.1)`.
Collect Service Operational Logs	Enabled by default. To change the setting later, go to the Basic Information tab on the cluster details page.
Metadata	Select Self-managed RDS. For setup instructions, see Create an RDS for MySQL instance and configure the database. For production clusters, use ApsaraDB RDS for MySQL. Configure the following parameters in `hivemetastore-site.xml`: `javax.jdo.option.ConnectionURL` (JDBC connection, e.g., `jdbc:mysql://rds.host.name/hive_db_name`), `javax.jdo.option.ConnectionUserName`, and `javax.jdo.option.ConnectionPassword`.

Hardware configuration

Parameter	Description
Billing Method	Use Pay-as-you-go for testing. Switch to Subscription for production.
Zone	Cannot be changed after creation.
VPC	Select an existing VPC in the region, or click Create VPC.
vSwitch	Select a vSwitch in the selected zone, or create one if none is available.
Default Security Group	Advanced security groups created on ECS are not supported. Select an existing security group or create a new one.
Node Group	Enable Assign Public Network IP for the master node group. Use default values for other parameters.
Cluster Scaling	Configure scaling rules.

Basic configuration

Parameter	Description
Cluster Name	1–64 characters. Supports Chinese characters, letters, digits, hyphens (-), and underscores (_).
Identity Credentials	Key Pair: SSH key authentication without a password. Password: Used to log on to the master node remotely. Record the password for later use.

Get the public IP address of the master node

Log on to the E-MapReduce console and select a region.
Click Nodes for the target cluster.
On the Nodes tab, click the icon in the node group row to see the Public IP Address and Node Name/ID.

If the public IP address is empty, click the node ID to open the instance details page. Under Basic Information > Public IP Address, either:

Click Associate EIP to attach an Elastic IP Address (EIP). For details, see Elastic IP Address.
Click Assign Public IP Address, set the bandwidth to greater than 0 Mbps, and let the system assign a static public IP address. For details, see Static public IP address.

Use an EIP to flexibly manage bandwidth and switch resources. Use a static public IP address when you only need a fixed egress that does not change.

Log on to the master node

Log on to the E-MapReduce console and select a region.
Click Nodes for the target cluster.
On the Nodes tab, click the icon in the node group row, then click the node ID to open the Instance details page.
Click Connect.
In the Remote connection dialog, keep the default Workbench option and click Sign in now.
Enter the instance credentials in the logon window.

If you used a key pair when creating the cluster, the private key was automatically downloaded as a .pem file. Select Secure Shell (SSH) key authentication and upload the private key file to log on without a password.

Create Hive test tables and insert data

After logging on to the master node, run the following commands in the terminal:

[user@emr-node ~]$ hive

-- Create a database.
hive> CREATE DATABASE IF NOT EXISTS myhive;

-- Switch to the database.
hive> USE myhive;

-- Create a non-partitioned table.
hive> CREATE TABLE IF NOT EXISTS employees (
        id INT,
        name STRING,
        age INT,
        department STRING
      )
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE;

-- Insert data.
hive> INSERT INTO employees VALUES
      (1, 'John', 25, 'Sales'),
      (2, 'Jane', 30, 'Marketing'),
      (3, 'Mike', 35, 'Engineering'),
      (4, 'Sarah', 28, 'HR'),
      (5, 'David', 32, 'Finance'),
      (6, 'Linda', 29, 'IT'),
      (7, 'Robert', 31, 'Operations'),
      (8, 'Emily', 27, 'Research'),
      (9, 'Michael', 33, 'Development'),
      (10, 'Chris', 26, 'Support');

-- Create a partitioned table.
hive> CREATE TABLE employees_pt (
        id INT,
        name STRING,
        age INT
      )
      PARTITIONED BY (department STRING)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- Insert data into partitions.
INSERT INTO employees_pt PARTITION (department='Sales') VALUES (1, 'John', 25), (2, 'Jane', 30), (3, 'Mike', 35);
INSERT INTO employees_pt PARTITION (department='Marketing') VALUES (4, 'Sarah', 28), (5, 'David', 32);
INSERT INTO employees_pt PARTITION (department='Engineering') VALUES (6, 'Linda', 29), (7, 'Robert', 31);
INSERT INTO employees_pt PARTITION (department='HR') VALUES (8, 'Emily', 27), (9, 'Michael', 33), (10, 'Chris', 26);

Verify the tables were created correctly:

-- Query the non-partitioned table.
hive> SELECT * FROM employees;

1       John    25      Sales
2       Jane    30      Marketing
3       Mike    35      Engineering
4       Sarah   28      HR
5       David   32      Finance
6       Linda   29      IT
7       Robert  31      Operations
8       Emily   27      Research
9       Michael 33      Development
10      Chris   26      Support

-- Query the partitioned table.
hive> SELECT * FROM employees_pt;

6       Linda   29      Engineering
7       Robert  31      Engineering
8       Emily   27      HR
9       Michael 33      HR
10      Chris   26      HR
4       Sarah   28      Marketing
5       David   32      Marketing
1       John    25      Sales
2       Jane    30      Sales
3       Mike    35      Sales

Step 3: Create a Hive+HDFS external data source

Before configuring the external data source, collect the required values from your EMR cluster.

Collect required values from your EMR cluster

You need three values: the cluster name, the NameNode address, and the HMS service address.

Cluster name (the dfs.nameservices value in HDFS configuration):

Log on to the E-MapReduce console and click the cluster ID.
On the Services tab, click Configure for HDFS.
Switch to the hdfs-site.xml tab and search for dfs.nameservices in the By Name column. The value is your cluster name.

NameNode address (format: <Private IP>:8020):

On the cluster details page, click Status for HDFS on the Services tab.
In the Components area, click the icon next to NameNode to expand the topology list.
Note the Private IP Address of the master-1-1 node. The NameNode address is <Private IP>:8020.

HMS service address (format: <Private IP>:9083):

On the cluster details page, click Status for Hive on the Services tab.
In the Components area, click the icon next to HiveRuntime to expand the topology list.
Note the Private IP Address of the master-1-1 node. The HMS service address is <Private IP>:9083.

Create the external data source

Log on to the MaxCompute console and select a region.
In the left navigation pane, choose Manage Configurations > External Data Source.
Click Create External Data Source.

Configure the following parameters:

Parameter	Required	Description
External Data Source Type	Yes	Select Hive+HDFS.
External Data Source Name	Yes	A custom name that starts with a letter and contains only lowercase letters, underscores (`_`), and digits. Maximum 128 characters. Example: `hive_hdfs_mc`.
Description	No	Optional description.
Network Connection Object	Yes	The VPC network connection from MaxCompute to the EMR or Hadoop VPC. The VPC must be in the same region as the MaxCompute project and the external data source. For setup instructions, see the Create a network connection between MaxCompute and the target VPC network section in VPC access solution (direct connection).
Cluster Name	Yes	The `dfs.nameservices` value from the HDFS configuration (collected above).
NameNode Address	Yes	The private IP address and port of the active and standby NameNode services. Port is usually 8020. Example: `192.168.x.x:8020` (collected above).
HMS Service Address	Yes	The private IP address and port of the Hive Metastore Service. Port is usually 9083. Example: `192.168.x.x:9083` (collected above).
Authentication Type	Yes	Currently, only No Authentication is supported.
Create vSwitch	Yes	MaxCompute uses the reverse access 2.0 solution to access the data source over VPC. Select an existing vSwitch or create one that meets the zone requirements shown in the console.

Click OK.
On the External Data Source page, find the new data source and click Details in the Actions column to verify the configuration.

Step 4: Create an external schema

Connect to your MaxCompute project and run the following SQL:

-- Enable schema syntax.
SET odps.namespace.schema=true;

CREATE EXTERNAL SCHEMA IF NOT EXISTS <YOUR_EXTERNAL_SCHEMA_NAME>
WITH <YOUR_EXTERNAL_DATASOURCE_NAME>
ON 'myhive';

Replace the placeholders with your values:

Placeholder	Description	Example
`<YOUR_EXTERNAL_SCHEMA_NAME>`	Name for the external schema in MaxCompute	`ex_hms_hdfs`
`<YOUR_EXTERNAL_DATASOURCE_NAME>`	Name of the external data source created in step 3. The MaxCompute project must be in the same region as the data source.	`hive_hdfs_mc`

The myhive value is the name of the Hive database created in step 2.

Step 5: Query Hive data from MaxCompute

All queries in this section require the SET odps.namespace.schema=true flag for external schema access.

List tables in the external schema

SET odps.namespace.schema=true;
SHOW TABLES IN <YOUR_EXTERNAL_SCHEMA_NAME>;

-- Expected output:
ALIYUN$xxx:employees
ALIYUN$xxx:employees_pt

OK

Important

If the query fails, check whether the MaxCompute VPC and security group have been added to the RDS whitelist in the RDS console. After adding the security group, wait at least 5 minutes for the whitelist to take effect.

Query a non-partitioned table

SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;

-- Result:
+------------+------------+------------+------------+
| id         | name       | age        | department |
+------------+------------+------------+------------+
| 1          | John       | 25         | Sales      |
| 2          | Jane       | 30         | Marketing  |
| 3          | Mike       | 35         | Engineering|
| 4          | Sarah      | 28         | HR         |
| 5          | David      | 32         | Finance    |
| 6          | Linda      | 29         | IT         |
| 7          | Robert     | 31         | Operations |
| 8          | Emily      | 27         | Research   |
| 9          | Michael    | 33         | Development|
| 10         | Chris      | 26         | Support    |
+------------+------------+------------+------------+

Query a partitioned table

SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='HR';

-- Result:
+------------+------------+------------+------------+
| id         | name       | age        | department |
+------------+------------+------------+------------+
| 8          | Emily      | 27         | HR         |
| 9          | Michael    | 33         | HR         |
| 10         | Chris      | 26         | HR         |
+------------+------------+------------+------------+

Step 6: Query new partitions added to Hive

MaxCompute reflects new partitions added to the Hive cluster without any manual refresh. To verify this behavior:

On the EMR master node, insert a new partition into the Hive table:

INSERT INTO employees_pt PARTITION (department='Computer')
  VALUES (11, 'Cily', 29), (12, 'Miky', 35);

-- Verify on the Hive side.
hive> SELECT * FROM employees_pt;

-- Result:
OK
11      Cily    29      Computer
12      Miky    35      Computer
6       Linda   29      Engineering
7       Robert  31      Engineering
8       Emily   27      HR
9       Michael 33      HR
10      Chris   26      HR
4       Sarah   28      Marketing
5       David   32      Marketing
1       John    25      Sales
2       Jane    30      Sales
3       Mike    35      Sales

In MaxCompute, query the new partition directly — no additional configuration needed:

SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='Computer';

-- Result:
+------------+------------+------------+------------+
| id         | name       | age        | department |
+------------+------------+------------+------------+
| 11         | Cily       | 29         | Computer   |
| 12         | Miky       | 35         | Computer   |
+------------+------------+------------+------------+

Step 7: Copy Hive data into MaxCompute

Federated queries read data from the Hadoop cluster each time they run. For tables you query frequently, copy the data into MaxCompute for better performance:

-- Copy the Hive table data into a MaxCompute native table.
CREATE TABLE employees_copy AS SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;

-- Verify the copied data.
SELECT * FROM employees_copy;

-- Result:
+------------+------------+------------+-------------+
| id         | name       | age        | department  |
+------------+------------+------------+-------------+
| 1          | John       | 25         | Sales       |
| 2          | Jane       | 30         | Marketing   |
| 3          | Mike       | 35         | Engineering |
| 4          | Sarah      | 28         | HR          |
| 5          | David      | 32         | Finance     |
| 6          | Linda      | 29         | IT          |
| 7          | Robert     | 31         | Operations  |
| 8          | Emily      | 27         | Research    |
| 9          | Michael    | 33         | Development |
| 10         | Chris      | 26         | Support     |
+------------+------------+------------+-------------+

Keep using the external schema for data that is actively updated in Hive, where you need the latest values without a separate ingestion step.

What's next

VPC access solution (direct connection) — Configure the network connection between MaxCompute and your Hadoop VPC
Create and use a DataLake cluster — Set up an EMR cluster for data lake analytics