This topic describes how to create an external schema in MaxCompute and query Hive table data in E-MapReduce (EMR).
Prerequisites
You have created a MaxCompute project and enabled the project-level metadata support for schemas for the MaxCompute project (the target project where you will create the external schema).
Limits
The data lakehouse solution is supported only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.
The VPC where MaxCompute is deployed must be in the same region as the VPC where the EMR cluster is deployed.
Step 1: Prepare Hive data
Log on to the EMR on ECS console, create an EMR cluster, and connect to the instance.
When you create an EMR cluster, pay attention to the following configuration items:
Configuration section
Configuration item
Description
Example
Software Configuration
Business Scenario
Select a business scenario as needed.
Custom cluster
Edition
Select an EMR version that is built based on Hadoop and Hive.
EMR-3.53.0
Optional Services
Select the Hadoop, HDFS, Hive, YARN, and ZooKeeper components. The related service processes are started by default for the selected components.
Hadoop-Common, HDFS, Hive, YARN, Spark3, ZooKeeper
Metadata
Select Self-managed RDS and configure the database connection parameters. For more information, see Configure self-managed RDS.
If your Hive+HDFS external data source is in an EMR cluster, you can use only self-managed RDS as the metadatabase. Data Lake Formation (DLF) unified metadata and built-in MySQL are not supported. Before you read data, make sure that the Hive component in the EMR cluster is running properly.
Self-managed RDS
After the cluster is created, click Nodes in the Actions column of the cluster.
On the Nodes tab, click the ID of the destination node in the emr-master node group to go to the Elastic Compute Service (ECS) console.
Select a tool to connect to the ECS instance. For more information, see Connection methods.
NoteIn this example, Workbench is used to connect to the instance. The logon password is the password that you configure when you create the cluster.
Access Hive and prepare data. Log on to the primary node of the cluster using Workbench. In the terminal interface, enter
hiveto enter the Hive mode, and then run the following commands:Create a non-partitioned table and write data to the table.
-- Create a database CREATE DATABASE IF NOT EXISTS myhive; -- Switch to the database USE myhive; -- Create a non-partitioned table CREATE TABLE IF NOT EXISTS employees ( id INT, name STRING, age INT, department STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; -- Insert data INSERT INTO employees VALUES(1, 'John', 25, 'Sales'),(2, 'Jane', 30, 'Marketing'),(3, 'Mike', 35, 'Engineering'),(4, 'Sarah', 28, 'HR'),(5, 'David', 32, 'Finance'),(6, 'Linda', 29, 'IT'),(7, 'Robert', 31, 'Operations'),(8, 'Emily', 27, 'Research'),(9, 'Michael', 33, 'Development'),(10, 'Chris', 26, 'Support'); -- Query the table data SELECT * FROM employees;Output:
1 John 25 Sales 2 Jane 30 Marketing 3 Mike 35 Engineering 4 Sarah 28 HR 5 David 32 Finance 6 Linda 29 IT 7 Robert 31 Operations 8 Emily 27 Research 9 Michael 33 Development 10 Chris 26 SupportCreate a partitioned table and write data to the table.
-- Create a partitioned table CREATE TABLE employees_pt ( id INT, name STRING, age INT ) PARTITIONED BY (department STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -- Insert data INSERT INTO employees_pt PARTITION (department='Sales') VALUES(1, 'John', 25),(2, 'Jane', 30),(3, 'Mike', 35); INSERT INTO employees_pt PARTITION (department='Marketing') VALUES(4, 'Sarah', 28),(5, 'David', 32); INSERT INTO employees_pt PARTITION (department='Engineering') VALUES(6, 'Linda', 29),(7, 'Robert', 31); INSERT INTO employees_pt PARTITION (department='HR') VALUES(8, 'Emily', 27),(9, 'Michael', 33),(10, 'Chris', 26); -- Query the table data SELECT * FROM employees_pt;Output:
6 Linda 29 Engineering 7 Robert 31 Engineering 8 Emily 27 HR 9 Michael 33 HR 10 Chris 26 HR 4 Sarah 28 Marketing 5 David 32 Marketing 1 John 25 Sales 2 Jane 30 Sales 3 Mike 35 Sales
Step 2: Create a Hive+HDFS external data source in MaxCompute
Log on to the MaxCompute console and select a region in the upper-left corner.
In the navigation pane on the left, choose Tenant Management > Foreign Server.
On the Foreign Server page, click Create Foreign Server.
In the Create Foreign Server dialog box, configure the following parameters:
Parameter
Description
Example
Foreign Server Type
Select Hive+HDFS.
Hive+HDFS
Foreign Server Name
You can specify a custom name. The naming conventions are as follows:
The name contains lowercase letters, underscores (_), and digits and starts with a lowercase letter.
The name can be up to 128 characters in length.
hive_fs
Foreign Server Description
Enter the information as needed.
None
Network Connection Object
The name of the network connection. You can select or establish a connection between MaxCompute and the VPC of an EMR or Hadoop cluster. For more information about the parameters, see the Create a network connection between MaxCompute and the target VPC section in Access a VPC (direct connection).
NoteFor more information about the basic concepts of network connections, see Networklink.
The VPC must be deployed in the same region as the MaxCompute foreign server and the project to which the foreign server is mounted.
networklink
Cluster Name
The name of the cluster. For a high availability (HA) Hadoop cluster, the value of this parameter is the same as the name of a NameNode process. To obtain the cluster name, perform the following operations:
Log on to the EMR console and click the Cluster ID to go to the cluster details page.
On the Cluster Services tab, click Configure for the HDFS service to go to the Configuration page.
Click the hdfs-site.xml tab. In the Configuration Item Name column, search for
dfs.nameservices. The value of this configuration item is the cluster name.
hdfs-cluster
NameNode Address
The IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 8020. To obtain the NameNode address, perform the following operations:
Log on to the EMR console and click the Cluster ID to go to the cluster details page.
On the Cluster Services tab, click HDFS to go to the Status page.
In the Component List area, click the
icon next to NameNode to expand the topology list.Obtain the Internal IP of the Master-1-1 node. The format of the NameNode Address is
Internal IP:8020.
172.22.xx.xxx:8020
HMS Service Address
The Hive Metastore Service (HMS) IP addresses and port numbers of the active and standby NameNode processes in the Hadoop cluster. In most cases, the port number is 9083. The format is
Internal IP:9083.NoteFor information about how to obtain the internal IP address, see the description of the NameNode Address parameter.
172.22.xx.xxx:9083
Authentication Type
Only No Authentication is supported.
No Authentication
Click OK to create the foreign server.
Step 3: Create an external schema in MaxCompute
Run the following command in the MaxCompute client to create an external schema:
You must enable the project-level metadata support for schemas for the target MaxCompute project in advance.
SET odps.namespace.schema=true;
CREATE EXTERNAL SCHEMA IF NOT EXISTS es_hive
WITH hive_fs
ON 'myhive' ;Code description:
es_hive: the name of the external schema. You can customize the name.
hive_fs: the name of the foreign server that you created in Step 2. The project to which the external schema belongs must be in the same region as the foreign server.
myhive: the name of the Hive database that you created in Step 1.
Step 4: Use SQL to access Hadoop Hive data in the MaxCompute client
Query the tables in the external schema in the MaxCompute client.
SET odps.namespace.schema=true; SHOW TABLES IN es_hive;Output:
ALIYUN$xxx:employees ALIYUN$xxx:employees_ptQuery the details of the tables in the external schema.
Query the non-partitioned table.
SELECT * FROM es_hive.employees;Output:
+------------+------------+------------+------------+ | id | name | age | department | +------------+------------+------------+------------+ | 1 | John | 25 | Sales | | 2 | Jane | 30 | Marketing | | 3 | Mike | 35 | Engineering| | 4 | Sarah | 28 | HR | | 5 | David | 32 | Finance | | 6 | Linda | 29 | IT | | 7 | Robert | 31 | Operations | | 8 | Emily | 27 | Research | | 9 | Michael | 33 | Development| | 10 | Chris | 26 | Support | +------------+------------+------------+------------+Query the partitioned table.
SELECT * FROM es_hive.employees_pt WHERE department='HR';Output:
+------------+------------+------------+------------+ | id | name | age | department | +------------+------------+------------+------------+ | 8 | Emily | 27 | HR | | 9 | Michael | 33 | HR | | 10 | Chris | 26 | HR | +------------+------------+------------+------------+
Step 5: Add new data to the Hadoop data source
Insert new partition data into the Hive partitioned table.
Log on to the primary node of the cluster using Workbench. In the terminal interface, enter
hiveto enter the Hive mode, and then run the following commands:-- Switch to the created database and insert data into the employees_pt table USE myhive; INSERT INTO employees_pt PARTITION (department='Computer') VALUES(11, 'Cily', 29),(12, 'Miky', 35); -- Query the partitioned table in Hive SELECT * FROM employees_pt;Output:
11 Cily 29 Computer 12 Miky 35 Computer 6 Linda 29 Engineering 7 Robert 31 Engineering 8 Emily 27 HR 9 Michael 33 HR 10 Chris 26 HR 4 Sarah 28 Marketing 5 David 32 Marketing 1 John 25 Sales 2 Jane 30 Sales 3 Mike 35 SalesQuery the new partition data in the MaxCompute client.
SELECT * FROM es_hive.employees_pt WHERE department='Computer';Output:
+------+------+------+------------+ | id | name | age | department | +------+------+------+------------+ | 11 | Cily | 29 | Computer | | 12 | Miky | 35 | Computer | +------+------+------+------------+
Step 6: Copy the data of the federated external table to the data warehouse in the MaxCompute client
-- Copy the data of the federated external table to the data warehouse
CREATE TABLE employees_copy AS SELECT * FROM es_hive.employees;
-- Query the copied data in the data warehouse table
SELECT * FROM employees_copy;Output:
+------------+------------+------------+------------+
| id | name | age | department |
+------------+------------+------------+------------+
| 1 | John | 25 | Sales |
| 2 | Jane | 30 | Marketing |
| 3 | Mike | 35 | Engineering|
| 4 | Sarah | 28 | HR |
| 5 | David | 32 | Finance |
| 6 | Linda | 29 | IT |
| 7 | Robert | 31 | Operations |
| 8 | Emily | 27 | Research |
| 9 | Michael | 33 | Development|
| 10 | Chris | 26 | Support |
+------------+------------+------------+------------+