Read Hadoop Hive data using HMS and HDFS - MaxCompute - Alibaba Cloud Documentation Center

This topic uses Hive on E-MapReduce as an example to describe how to create an external schema in MaxCompute and query Hive table data in Hadoop.

Usage notes

This feature is supported only in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Singapore, and Germany (Frankfurt) regions.

Procedure

Step 1: Prerequisites

Skip this step if you already have Hive data.

Activate E-MapReduce.
Create a MaxCompute project.
The MaxCompute project and the Hadoop cluster must be in the same region.
Enable schema support for your MaxCompute project.
1. Log on to the MaxCompute console and select a region in the top-left corner.
2. In the navigation pane on the left, choose Manage Configurations > Projects.
3. On the Projects page, click Upgrade to Support Schemas in the Actions column of the target project.

Step 2: Prepare Hive data

Skip this step if you already have Hive data.

Log on to the E-MapReduce console and select a region in the upper-left corner.

On the My Clusters page, click Create Cluster.

Software Configuration

In the top menu bar, select a resource group. By default, all resources under your account are displayed. Configure the other parameters as follows.

Parameter	Description
Region	The cluster is created in the selected region. The region cannot be changed after the cluster is created. The MaxCompute project and the Hadoop cluster must be in the same region.
Business Scenario	For this example, select Custom Cluster.
Product Version	Select a version. The latest version is recommended.
High Service Availability	This feature is disabled by default. If you enable High Service Availability, EMR distributes master nodes across different underlying hardware to reduce the risk of failures.
Optional Services (Select One At Least)	For this example, select `HADOOP-COMMON(3.2.1)`, `HDFS(3.2.1)`, `HIVE(3.1.3)`, and `YARN(3.2.1)`.
Collect Service Operational Logs	Log collection is enabled by default. if you want to collect service operational logs, you can modify the collection status on the Basic Information tab of the cluster details page.
Metadata	This example uses a Self-managed RDS instance. For more information about how to create an RDS instance, see Create an RDS for MySQL instance and configure the database. For an EMR cluster that is used in the production environment, we recommend that you use independent ApsaraDB RDS for MySQL. For example, you can use ApsaraDB RDS for MySQL 5.7 of Alibaba Cloud high-availability edition. Configure the RDS connection information in hivemetastore-site.xml: javax.jdo.option.ConnectionURL The Java Database Connectivity (JDBC) metadata connection. For example: `jdbc:mysql://rds.host.name/hive_db_name` javax.jdo.option.ConnectionUserName The database account for the metadata. javax.jdo.option.ConnectionPassword The database password for the metadata.

The preceding configurations are for this example only. For more information about how to create a data lake analytics cluster in the E-MapReduce console, see Create and use a DataLake cluster.

Hardware Configuration

Parameter	Description
Billing Method	Subscription Pay-as-you-go For testing, use Pay-as-you-go. After testing is complete, you can release the cluster and create a Subscription cluster for production.
Zone	The zone cannot be changed after the cluster is created. Choose carefully.
VPC	Select a virtual private cloud (VPC) in the region. If none exists, click Create VPC. After you create a VPC, click Refresh to select it.
vSwitch	Select a vSwitch in the zone of the selected VPC. If no vSwitch is available in the zone, create one.
Default Security Group	Important Advanced security groups created on ECS are not supported. If you have a security group in use, you can select it or create a new one.
Node Group	Configure the master, core, or task node groups based on your business needs. For more information, see Select hardware and network. For this example, enable Assign Public Network IP for the master node group and use the default values for other parameters.
Cluster Scaling	Configure scaling rules.

Basic Configuration

Parameter	Description
Cluster Name	The name of the cluster. The name must be 1 to 64 characters in length and can contain Chinese characters, letters, digits, hyphens (-), and underscores (_).
Identity Credentials	Key Pair: If you want to perform identity verification without entering a password, you can use a Key Pair. For more information, see Manage SSH key pairs. Password: The password used to remotely log on to the master node of the cluster. Record this password. You will need it to log on to the cluster.

Log on to the master node of the cluster and prepare data in Hive.

Log on to the cluster master node
1. Log on to the E-MapReduce console and select a region in the upper-left corner.
2. Click Nodes for the target cluster.
3. On the Nodes tab, click the icon in the node group row, and then click the node ID to go to the Instance details page.
4. On the Instance page for the specified node, click Connect.
5. In the Remote connection dialog box, with the default Workbench option selected, click Sign in now.
6. In the logon window, enter the instance information as needed.
  Note
  When you create a key pair, the private key is automatically downloaded and saved as a .pem file. You can select Secure Shell (SSH) key authentication and upload the private key file to perform identity verification without a password.

Prepare Hive test data. After you log on to the cluster master node using Workbench, run the following commands in the terminal:

[user@emr-node ~]$ hive

-- Create a database.
hive> CREATE database IF NOT EXISTS myhive;

-- Switch to the database.
hive> USE myhive;

-- Create a non-partitioned table.
hive> CREATE TABLE IF NOT EXISTS employees (
        id INT,
        name STRING,
        age INT,
        department STRING
      )
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      STORED AS TEXTFILE;

-- Insert data.
hive> INSERT INTO employees VALUES(1, 'John', 25, 'Sales'),
    (2, 'Jane', 30, 'Marketing'),
    (3, 'Mike', 35, 'Engineering'),
    (4, 'Sarah', 28, 'HR'),
    (5, 'David', 32, 'Finance'),
    (6, 'Linda', 29, 'IT'),
    (7, 'Robert', 31, 'Operations'),
    (8, 'Emily', 27, 'Research'),
    (9, 'Michael', 33, 'Development'),
    (10, 'Chris', 26, 'Support');

-- Create a partitioned table.
hive> CREATE TABLE employees_pt (
      id INT,
      name STRING,
      age INT
    )
  PARTITIONED BY (department STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

-- Insert data.
INSERT INTO employees_pt PARTITION (department='Sales') VALUES(1, 'John', 25),(2, 'Jane', 30),(3, 'Mike', 35);
INSERT INTO employees_pt PARTITION (department='Marketing') VALUES(4, 'Sarah', 28),(5, 'David', 32);
INSERT INTO employees_pt PARTITION (department='Engineering') VALUES(6, 'Linda', 29),(7, 'Robert', 31);
INSERT INTO employees_pt PARTITION (department='HR') VALUES(8, 'Emily', 27),(9, 'Michael', 33),(10, 'Chris', 26);

Query the non-partitioned table and the partitioned table:

-- Query the non-partitioned table.
hive> SELECT * FROM employees;

1       John    25      Sales
2       Jane    30      Marketing
3       Mike    35      Engineering
4       Sarah   28      HR
5       David   32      Finance
6       Linda   29      IT
7       Robert  31      Operations
8       Emily   27      Research
9       Michael 33      Development
10      Chris   26      Support

-- Query the partitioned table.
hive> SELECT * FROM employees_pt;

6       Linda   29      Engineering
7       Robert  31      Engineering
8       Emily   27      HR
9       Michael 33      HR
10      Chris   26      HR
4       Sarah   28      Marketing
5       David   32      Marketing
1       John    25      Sales
2       Jane    30      Sales
3       Mike    35      Sales

Step 3: Create a Hive+HDFS external data source

Log on to the MaxCompute console and select a region in the top-left corner.
In the navigation pane on the left, choose Manage Configurations > Foreign Server.
On the Foreign Server page, click Create Foreign Server.

In the Add Foreign Server dialog box, configure the parameters. The following tables describe the parameters.

Parameter	Required	Description
Foreign Server Type	Required	Select Hive+HDFS.
Foreign Server Name	Required	Enter a custom name. The naming convention is as follows: The name must start with a letter and can contain only lowercase letters, underscores (_), and digits. The name cannot exceed 128 characters. For example, hive_hdfs_mc.
Foreign Server Description	Optional	Enter a description as needed.
Network Connection Object	Required	The name of the network connection. Select or create a connection from MaxCompute to the Alibaba Cloud E-MapReduce or Hadoop VPC network. Important The VPC must be in the same region as the MaxCompute external data source and the project to which the external data source is attached. For more information about the parameters, see the Create a network connection between MaxCompute and the target VPC network step in VPC access solution (direct connection).
Cluster Name	Required	The name used to refer to the NameNode in a high-availability (HA) Hadoop cluster. Take an EMR cluster as an example. To obtain the cluster name: Log on to the E-MapReduce console and select a region in the upper-left corner. Click the ID of the target cluster to go to the cluster details page. On the Services tab, click Configure for the HDFS service to go to the Configure page. Switch to the hdfs-site.xml tab. In the By Name column, search for `dfs.nameservices`. The value of this configuration item is the Cluster Name.
NameNode Address	Required	The endpoints and port numbers of the active and standby NameNode services of the target Hadoop cluster. The port number is usually 8020. Take an EMR cluster as an example. To obtain the NameNode address: Log on to the E-MapReduce console and select a region in the upper-left corner. Click the ID of the target cluster to go to the cluster details page. On the Services tab, click Status for the HDFS service to go to the Status page. In the Components area, click the icon next to NameNode to expand the topology list. Obtain the Private IP Address of the master-1-1 node. The NameNode Address is in the format `Internal IP:8020`.
HMS Service Address	Required	The endpoints and port numbers of the Hive metadata services of the active and standby NameNodes of the target Hadoop cluster. The port number is usually 9083. Log on to the E-MapReduce console and select a region in the upper-left corner. Click the ID of the target cluster to go to the cluster details page. On the Services tab, click Status for the Hive service to go to the Status page. In the Components area, click the icon next to HiveRuntime to expand the topology list. Obtain the Private IP Address of the master-1-1 node. The HMS Service Address is in the format `Internal IP:9083`.
Authentication Type	This is required.	Currently, only No Authentication is supported.
Create vSwitch	Required	MaxCompute accesses the data source through a VPC. By default, the reverse access 2.0 solution is used. This solution requires you to configure a vSwitch in a specific zone to establish the metadata access link. The available zones for vSwitches in each region are described on the UI. In the VPC where the data source resides, select an existing vSwitch or create a new one that meets the zone requirements.

Click OK to create the external data source.
On the Foreign Server page, find the target data source and click Details in the Actions column.

Step 4: Create an external schema

Connect to your MaxCompute project.
Use the following code to create an external schema:
```
-- Enable schema syntax. 
SET odps.namespace.schema=true;

CREATE EXTERNAL SCHEMA IF NOT EXISTS <YOUR_EXTERNAL_SCHEMA_NAME>
WITH  <YOUR_EXTERNAL_DATASOURCE_NAME>
ON 'myhive' ;
```
Parameter description:
- your_external_schema_name: The name of the external schema to create. For example, ex_hms_hdfs.
- your_external_datasource_name: The name of the external data source created in MaxCompute in the previous step. The project to which the external schema belongs must be in the same region as the external data source. For example, hive_hdfs_mc.
- myhive: The name of the Hive database created when you prepared the Hive data.

Step 5: Use SQL to access Hadoop Hive data

Query tables in the external schema.
```
SET odps.namespace.schema=true;
SHOW tables IN <YOUR_EXTERNAL_SCHEMA_NAME>;

-- The following result is returned:
ALIYUN$xxx:employees
ALIYUN$xxx:employees_pt

OK
```
Important
If the query fails, log on to the RDS console to check whether the VPC and security group of MaxCompute have been added to the RDS whitelist. It takes at least 5 minutes for the whitelist to take effect after the security group is added.

Query the details of the external schema table.

-- Query the non-partitioned table.
SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;

-- Result.
+------------+------------+------------+------------+
| id         | name       | age        | department | 
+------------+------------+------------+------------+
| 1          | John       | 25         | Sales      | 
| 2          | Jane       | 30         | Marketing  | 
| 3          | Mike       | 35         | Engineering | 
| 4          | Sarah      | 28         | HR         | 
| 5          | David      | 32         | Finance    | 
| 6          | Linda      | 29         | IT         | 
| 7          | Robert     | 31         | Operations | 
| 8          | Emily      | 27         | Research   | 
| 9          | Michael    | 33         | Development | 
| 10         | Chris      | 26         | Support    | 
+------------+------------+------------+------------+

-- Query the partitioned table.
SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='HR';

-- Result.
+------------+------------+------------+------------+
| id         | name       | age        | department | 
+------------+------------+------------+------------+
| 8          | Emily      | 27         | HR         | 
| 9          | Michael    | 33         | HR         | 
| 10         | Chris      | 26         | HR         | 
+------------+------------+------------+------------+

Step 6: Add new data to the Hadoop data source

Log on to the master node of the cluster created with EMR and insert new partition data into the Hive partitioned table:

INSERT INTO employees_pt PARTITION (department='Computer') 
  VALUES(11, 'Cily', 29),(12, 'Miky', 35);
 
 -- Query the partitioned table on the Hive side.
 hive> SELECT * FROM employees_pt;
 
 -- Query result.
 OK
11      Cily    29      Computer
12      Miky    35      Computer
6       Linda   29      Engineering
7       Robert  31      Engineering
8       Emily   27      HR
9       Michael 33      HR
10      Chris   26      HR
4       Sarah   28      Marketing
5       David   32      Marketing
1       John    25      Sales
2       Jane    30      Sales
3       Mike    35      Sales

Log on to the MaxCompute client and query the new partition data on the MaxCompute side:

SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees_pt WHERE department='Computer';

-- Result.
+------------+------------+------------+------------+
| id         | name       | age        | department | 
+------------+------------+------------+------------+
| 11         | Cily       | 29         | Computer   | 
| 12         | Miky       | 35         | Computer   | 
+------------+------------+------------+------------+

Step 7: Copy data from the federated external table to the data warehouse

-- Copy data from the federated external table to the data warehouse.
CREATE TABLE employees_copy AS SELECT * FROM <YOUR_EXTERNAL_SCHEMA_NAME>.employees;

-- Query the copied table data in the data warehouse.
SELECT * FROM employees_copy;

-- The following result is returned:
+------------+------------+------------+-------------+
| id         | name       | age        | department  | 
+------------+------------+------------+-------------+
| 1          | John       | 25         | Sales       | 
| 2          | Jane       | 30         | Marketing   | 
| 3          | Mike       | 35         | Engineering | 
| 4          | Sarah      | 28         | HR          | 
| 5          | David      | 32         | Finance     | 
| 6          | Linda      | 29         | IT          | 
| 7          | Robert     | 31         | Operations  | 
| 8          | Emily      | 27         | Research    | 
| 9          | Michael    | 33         | Development | 
| 10         | Chris      | 26         | Support     | 
+------------+------------+------------+-------------+