This document uses the scenario of syncing a single E-MapReduce (EMR) Hive table to MaxCompute to demonstrate best practices for configuring a data source, establishing network connectivity, and setting up a batch synchronization node.
Background
Hive can be used to store, query, and analyze large-scale data in Hadoop. The tool maps structured data files to database tables and provides SQL query capabilities by converting SQL statements into MapReduce jobs.
Prerequisites
You have purchased a serverless resource group.
You have created a Hive data source and a MaxCompute data source. For more information, see Data source configuration.
You have established a network connection between the resource group and the data source. For more information, see Network connectivity solutions.
NoteIf you use a public endpoint to connect an exclusive resource group to EMR, you must configure the Security Group rules for the EMR cluster to allow access from the elastic IP (EIP) address of the exclusive resource group. The Security Group's inbound rules must allow access to the required EMR cluster ports, such as 10000, 9093, and 8020.
Limitations
Syncing source data to MaxCompute external tables is not supported.
Procedure
Step 1: Create a node and configure task
For the general steps to create a node and use the Codeless UI, see Configure a task in the Codeless UI.
Step 2: Configure the data source and destination
Configure the data source (Hive)
This section describes the key parameters for configuring the Hive data source, which is a Hive table in this example.
Parameter | Description |
Hive Read Method |
Note The HDFS method offers higher efficiency. The JDBC method generates a MapReduce program, resulting in lower sync performance. Note that the HDFS method does not support conditional filtering or reading from views. Choose the method that best suits your requirements. |
Table | Ensure that the Table Schema is identical in both the development and production environments for the Hive data source. Note This section displays the list of tables and the Table Schema from your development environment. If the table definitions differ between your development and production environments, the task might be configured correctly in the development environment but fail in production with errors such as "table not found" or "column not found". |
Parquet schema | If the Hive table is stored in the Parquet format, you must configure the corresponding Parquet schema. |
Configure the data destination (MaxCompute)
This section describes the key parameters for configuring the MaxCompute Data Destination.
Parameters not described in the following table can be left at their default values.
Parameter | Description |
Tunnel Resource Group | The MaxCompute Tunnel Quota used for data transfer. By default, 'Public transport resources' is selected, which is the free quota provided by MaxCompute. If your exclusive Tunnel Quota becomes unavailable due to overdue payments or expiration, the task automatically switches to 'Public transport resources' at runtime. |
Table | Select the target MaxCompute table. If you are using a standard DataWorks Workspace, ensure that a MaxCompute table with the same name and a consistent Table Schema exists in both your development and production environments. You can also click Generate Destination Table Schema. The system automatically creates a table to receive the data. You can manually adjust the table creation statement. Note Note the following:
|
Partition Information | If the destination table is partitioned, specify the value for the partition column.
|
Write Method | Specifies whether to overwrite or append data in the target table. |
Step 3: Configure and validate the task
Field Mapping: Typically, you can use Map Fields with the Same Name or Map Fields in the Same Line. If the order or names of the fields in the source and destination differ, you can adjust the mappings manually.
Channel Control: Set the Policy for Dirty Data Records to reject any dirty data to ensure data quality. You can initially keep the default values for other parameters.
Step 4: Configure and debug the task
On the right side of the batch synchronization node configuration page, click Run Configuration. Set the Resource Group and Script Parameters for the debug run. Then, click Run in the top toolbar to test if the sync pipeline runs successfully.
In the left-side navigation pane, click
, and then click
to the right of Personal Directory to create a file with a .sqlextension. Run the following SQL statement to query the destination table and verify that the data meets expectations.NoteTo query data this way, you must bind the destination MaxCompute project as a computing resource for DataWorks.
On the right side of the
.sqlfile editing page, click Run Configuration. Specify the Type, Computing Resources, and Resource Group, and then click Run in the top toolbar.
SELECT * FROM <your_maxcompute_destination_table> WHERE pt=<your_partition> LIMIT 20;
Step 5: Configure scheduling and publish the task
On the right side of the batch synchronization node, click Scheduling Settings. Configure the parameters for periodic runs as described in Scheduling Configuration. Then, click Publish in the top toolbar and follow the on-screen instructions to publish the task.