This tutorial walks you through querying Hive table data in a Hadoop cluster directly from MaxCompute — without moving the data. It uses Hive on E-MapReduce (EMR) as the example environment.
MaxCompute connects to your Hadoop cluster over a Virtual Private Cloud (VPC) network, reads Hive table metadata through the Hive Metastore Service (HMS), and accesses the underlying data files directly through the Hadoop Distributed File System (HDFS). Once you map a Hive database to an external schema in MaxCompute, you can query Hive tables using standard SQL — including partitioned tables and newly added partitions — without importing the data.
Supported regions
China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt).
Prerequisites
Before you begin, make sure you have:
An activated E-MapReduce service and an EMR cluster with Hive data, or plan to create one in steps 1 and 2
A MaxCompute project in the same region as your Hadoop cluster
Schema support enabled on the MaxCompute project (see step 1)
A VPC network connection from MaxCompute to the EMR cluster VPC (configured in step 3)
Step 1: Enable schema support
Skip this step if schema support is already enabled on your MaxCompute project.
Log on to the MaxCompute console and select a region in the upper-left corner.
In the left navigation pane, choose Manage Configurations > Projects.
On the Projects page, find the target project and click Enable Schema in the Actions column.
Step 2: Prepare Hive data
Skip this step if you already have Hive data in your EMR cluster.
Create an EMR cluster
Log on to the E-MapReduce console and select a region in the upper-left corner.
On the My Clusters page, click Create Cluster and configure the following settings: For a complete guide on creating a cluster, see Create and use a DataLake cluster.
Software configuration
Parameter Description Region Select the same region as your MaxCompute project. The region cannot be changed after the cluster is created. Business Scenario Select Custom Cluster. Product Version Select the latest version. High Service Availability Optional. When enabled, EMR distributes master nodes across different underlying hardware to reduce failure risk. Optional Services (Select One At Least) Select HADOOP-COMMON(3.2.1),HDFS(3.2.1),HIVE(3.1.3), andYARN(3.2.1).Collect Service Operational Logs Enabled by default. To change the setting later, go to the Basic Information tab on the cluster details page. Metadata Select Self-managed RDS. For setup instructions, see Create an RDS for MySQL instance and configure the database. For production clusters, use ApsaraDB RDS for MySQL. Configure the following parameters in hivemetastore-site.xml:javax.jdo.option.ConnectionURL(JDBC connection, e.g.,jdbc:mysql://rds.host.name/hive_db_name),javax.jdo.option.ConnectionUserName, andjavax.jdo.option.ConnectionPassword.Hardware configuration
Parameter Description Billing Method Use Pay-as-you-go for testing. Switch to Subscription for production. Zone Cannot be changed after creation. VPC Select an existing VPC in the region, or click Create VPC. vSwitch Select a vSwitch in the selected zone, or create one if none is available. Default Security Group Advanced security groups created on ECS are not supported. Select an existing security group or create a new one. Node Group Enable Assign Public Network IP for the master node group. Use default values for other parameters. Cluster Scaling Configure scaling rules. Basic configuration
Parameter Description Cluster Name 1–64 characters. Supports Chinese characters, letters, digits, hyphens (-), and underscores (_). Identity Credentials Key Pair: SSH key authentication without a password. Password: Used to log on to the master node remotely. Record the password for later use.
Get the public IP address of the master node
Log on to the E-MapReduce console and select a region.
Click Nodes for the target cluster.
On the Nodes tab, click the
icon in the node group row to see the Public IP Address and Node Name/ID.
If the public IP address is empty, click the node ID to open the instance details page. Under Basic Information > Public IP Address, either:
Click Associate EIP to attach an Elastic IP Address (EIP). For details, see Elastic IP Address.
Click Assign Public IP Address, set the bandwidth to greater than 0 Mbps, and let the system assign a static public IP address. For details, see Static public IP address.
Use an EIP to flexibly manage bandwidth and switch resources. Use a static public IP address when you only need a fixed egress that does not change.
Log on to the master node
Log on to the E-MapReduce console and select a region.
Click Nodes for the target cluster.
On the Nodes tab, click the
icon in the node group row, then click the node ID to open the Instance details page.Click Connect.
In the Remote connection dialog, keep the default Workbench option and click Sign in now.
Enter the instance credentials in the logon window.
If you used a key pair when creating the cluster, the private key was automatically downloaded as a .pem file. Select Secure Shell (SSH) key authentication and upload the private key file to log on without a password.Create Hive test tables and insert data
After logging on to the master node, run the following commands in the terminal:
[user@emr-node ~]$ hive
-- Create a database.
hive> CREATE DATABASE IF NOT EXISTS myhive;
-- Switch to the database.
hive> USE myhive;
-- Create a non-partitioned table.
hive> CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
age INT,
department STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- Insert data.
hive> INSERT INTO employees VALUES
(1, 'John', 25, 'Sales'),
(2, 'Jane', 30, 'Marketing'),
(3, 'Mike', 35, 'Engineering'),
(4, 'Sarah', 28, 'HR'),
(5, 'David', 32, 'Finance'),
(6, 'Linda', 29, 'IT'),
(7, 'Robert', 31, 'Operations'),
(8, 'Emily', 27, 'Research'),
(9, 'Michael', 33, 'Development'),
(10, 'Chris', 26, 'Support');
-- Create a partitioned table.
hive> CREATE TABLE employees_pt (
id INT,
name STRING,
age INT
)
PARTITIONED BY (department STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
-- Insert data into partitions.
INSERT INTO employees_pt PARTITION (department='Sales') VALUES (1, 'John', 25), (2, 'Jane', 30), (3, 'Mike', 35);
INSERT INTO employees_pt PARTITION (department='Marketing') VALUES (4, 'Sarah', 28), (5, 'David', 32);
INSERT INTO employees_pt PARTITION (department='Engineering') VALUES (6, 'Linda', 29), (7, 'Robert', 31);
INSERT INTO employees_pt PARTITION (department='HR') VALUES (8, 'Emily', 27), (9, 'Michael', 33), (10, 'Chris', 26);Verify the tables were created correctly:
-- Query the non-partitioned table.
hive> SELECT * FROM employees;
1 John 25 Sales
2 Jane 30 Marketing
3 Mike 35 Engineering
4 Sarah 28 HR
5 David 32 Finance
6 Linda 29 IT
7 Robert 31 Operations
8 Emily 27 Research
9 Michael 33 Development
10 Chris 26 Support
-- Query the partitioned table.
hive> SELECT * FROM employees_pt;
6 Linda 29 Engineering
7 Robert 31 Engineering
8 Emily 27 HR
9 Michael 33 HR
10 Chris 26 HR
4 Sarah 28 Marketing
5 David 32 Marketing
1 John 25 Sales
2 Jane 30 Sales
3 Mike 35 SalesStep 3: Create a Hive+HDFS external data source
Before configuring the external data source, collect the required values from your EMR cluster.
Collect required values from your EMR cluster
You need three values: the cluster name, the NameNode address, and the HMS service address.
Cluster name (the dfs.nameservices value in HDFS configuration):
Log on to the E-MapReduce console and click the cluster ID.
On the Services tab, click Configure for HDFS.
Switch to the hdfs-site.xml tab and search for
dfs.nameservicesin the By Name column. The value is your cluster name.
NameNode address (format: <Private IP>:8020):
On the cluster details page, click Status for HDFS on the Services tab.
In the Components area, click the
icon next to NameNode to expand the topology list.Note the Private IP Address of the master-1-1 node. The NameNode address is
<Private IP>:8020.
HMS service address (format: <Private IP>:9083):
On the cluster details page, click Status for Hive on the Services tab.
In the Components area, click the
icon next to HiveRuntime to expand the topology list.Note the Private IP Address of the master-1-1 node. The HMS service address is
<Private IP>:9083.
Create the external data source
Log on to the MaxCompute console and select a region.
In the left navigation pane, choose Manage Configurations > External Data Source.
Click Create External Data Source.
Configure the following parameters:
Parameter Required Description External Data Source Type Yes Select Hive+HDFS. External Data Source Name Yes A custom name that starts with a letter and contains only lowercase letters, underscores ( _), and digits. Maximum 128 characters. Example:hive_hdfs_mc.Description No Optional description. Network Connection Object Yes The VPC network connection from MaxCompute to the EMR or Hadoop VPC. The VPC must be in the same region as the MaxCompute project and the external data source. For setup instructions, see the Create a network connection between MaxCompute and the target VPC network section in VPC access solution (direct connection). Cluster Name Yes The dfs.nameservicesvalue from the HDFS configuration (collected above).NameNode Address Yes The private IP address and port of the active and standby NameNode services. Port is usually 8020. Example: 192.168.x.x:8020(collected above).HMS Service Address Yes The private IP address and port of the Hive Metastore Service. Port is usually 9083. Example: 192.168.x.x:9083(collected above).Authentication Type Yes Currently, only No Authentication is supported. Create vSwitch Yes MaxCompute uses the reverse access 2.0 solution to access the data source over VPC. Select an existing vSwitch or create one that meets the zone requirements shown in the console. Click OK.
On the External Data Source page, find the new data source and click Details in the Actions column to verify the configuration.
Step 4: Create an external schema
Connect to your MaxCompute project and run the following SQL:
-- Enable schema syntax.
SET odps.namespace.schema=true;
CREATE EXTERNAL SCHEMA IF NOT EXISTS <YOUR_EXTERNAL_SCHEMA_NAME>
WITH <YOUR_EXTERNAL_DATASOURCE_NAME>
ON 'myhive';Replace the placeholders with your values:
| Placeholder | Description | Example |
|---|---|---|
<YOUR_EXTERNAL_SCHEMA_NAME> | Name for the external schema in MaxCompute | ex_hms_hdfs |
<YOUR_EXTERNAL_DATASOURCE_NAME> | Name of the external data source created in step 3. The MaxCompute project must be in the same region as the data source. | hive_hdfs_mc |
The myhive value is the name of the Hive database created in step 2.
Step 5: Query Hive data from MaxCompute
All queries in this section require the SET odps.namespace.schema=true flag for external schema access.
List tables in the external schema
SET odps.namespace.schema=true;
SHOW TABLES IN <YOUR_EXTERNAL_SCHEMA_NAME>;
-- Expected output:
ALIYUN$xxx:employees
ALIYUN$xxx:employees_pt
OKIf the query fails, check whether the MaxCompute VPC and security group have been added to the RDS whitelist in the RDS console. After adding the security group, wait at least 5 minutes for the whitelist to take effect.
Query a non-partitioned table
SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;
-- Result:
+------------+------------+------------+------------+
| id | name | age | department |
+------------+------------+------------+------------+
| 1 | John | 25 | Sales |
| 2 | Jane | 30 | Marketing |
| 3 | Mike | 35 | Engineering|
| 4 | Sarah | 28 | HR |
| 5 | David | 32 | Finance |
| 6 | Linda | 29 | IT |
| 7 | Robert | 31 | Operations |
| 8 | Emily | 27 | Research |
| 9 | Michael | 33 | Development|
| 10 | Chris | 26 | Support |
+------------+------------+------------+------------+Query a partitioned table
SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='HR';
-- Result:
+------------+------------+------------+------------+
| id | name | age | department |
+------------+------------+------------+------------+
| 8 | Emily | 27 | HR |
| 9 | Michael | 33 | HR |
| 10 | Chris | 26 | HR |
+------------+------------+------------+------------+Step 6: Query new partitions added to Hive
MaxCompute reflects new partitions added to the Hive cluster without any manual refresh. To verify this behavior:
On the EMR master node, insert a new partition into the Hive table:
INSERT INTO employees_pt PARTITION (department='Computer') VALUES (11, 'Cily', 29), (12, 'Miky', 35); -- Verify on the Hive side. hive> SELECT * FROM employees_pt; -- Result: OK 11 Cily 29 Computer 12 Miky 35 Computer 6 Linda 29 Engineering 7 Robert 31 Engineering 8 Emily 27 HR 9 Michael 33 HR 10 Chris 26 HR 4 Sarah 28 Marketing 5 David 32 Marketing 1 John 25 Sales 2 Jane 30 Sales 3 Mike 35 SalesIn MaxCompute, query the new partition directly — no additional configuration needed:
SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='Computer'; -- Result: +------------+------------+------------+------------+ | id | name | age | department | +------------+------------+------------+------------+ | 11 | Cily | 29 | Computer | | 12 | Miky | 35 | Computer | +------------+------------+------------+------------+
Step 7: Copy Hive data into MaxCompute
Federated queries read data from the Hadoop cluster each time they run. For tables you query frequently, copy the data into MaxCompute for better performance:
-- Copy the Hive table data into a MaxCompute native table.
CREATE TABLE employees_copy AS SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;
-- Verify the copied data.
SELECT * FROM employees_copy;
-- Result:
+------------+------------+------------+-------------+
| id | name | age | department |
+------------+------------+------------+-------------+
| 1 | John | 25 | Sales |
| 2 | Jane | 30 | Marketing |
| 3 | Mike | 35 | Engineering |
| 4 | Sarah | 28 | HR |
| 5 | David | 32 | Finance |
| 6 | Linda | 29 | IT |
| 7 | Robert | 31 | Operations |
| 8 | Emily | 27 | Research |
| 9 | Michael | 33 | Development |
| 10 | Chris | 26 | Support |
+------------+------------+------------+-------------+Keep using the external schema for data that is actively updated in Hive, where you need the latest values without a separate ingestion step.
What's next
VPC access solution (direct connection) — Configure the network connection between MaxCompute and your Hadoop VPC
Create and use a DataLake cluster — Set up an EMR cluster for data lake analytics