This topic describes how to create and configure a batch synchronization node to synchronize data in an E-MapReduce (EMR) Hive database to MaxCompute.

Background information

Hive is a Hadoop-based data warehouse tool that is used to extract, transform, and load data. You can use Hive to store, query, and analyze large-scale data that is stored in Hadoop. Hive maps structured data files to a database table, supports SQL queries, and converts SQL statements into Map or Reduce tasks. Data Integration supports the synchronization of data between Hive and another type of data source.

Prepare data sources

Add a MaxCompute data source

You can associate a MaxCompute compute engine with the workspace that you want to use to enable the system to generate a MaxCompute data source. You can also add a MaxCompute data source to the workspace that you want to use. For more information, see Associate a MaxCompute compute engine with a workspace or Add a MaxCompute data source.

Add a Hive data source

In the upper-right corner of the Data source page in the DataWorks console, click Add data source. In the Add data source dialog box, add a Hive data source as prompted. You can add a Hive data source in one of the following modes: Alibaba Cloud instance mode, connection string mode, and built-in mode of CDH. If the Hive data source belongs to Alibaba Cloud EMR, we recommend that you add the data source in Alibaba Cloud instance mode. If the Hive data source does not belong to Alibaba Cloud EMR, we recommend that you add the data source in connection string mode. If your workspace is associated with a CDH compute engine, we recommend that you add the Hive data source in built-in mode of CDH. For more information, see Add a Hive data source.

Create an exclusive resource group for Data Integration and establish a network connection between the resource group and the EMR Hive data source

Before you synchronize data, you must establish a network connection between your exclusive resource group for Data Integration and the EMR Hive data source. For more information, see Configure network connectivity.

  • If your exclusive resource group for Data Integration and the EMR Hive data source reside in the same region, you can use a virtual private cloud (VPC) that resides in the region to establish a network connection between the resource group and the data source. To establish a network connection, perform the operations in Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group.
  • If your exclusive resource group for Data Integration and the EMR Hive data source reside in different regions, you can establish a network connection between the resource group and the data source over the Internet. In this case, you must configure a security group rule to allow access from the exclusive resource group for Data Integration to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled. For more information, see Step 2: Configure a security group rule to allow access to the EMR Hive data source.

Step 1: Associate the exclusive resource group for Data Integration with a VPC and add a custom route for the resource group

Note If you establish a network connection between the exclusive resource group for Data Integration and a data source over the Internet, you can skip this step.
  1. Associate the exclusive resource group for Data Integration with a VPC.
    1. Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
    2. On the VPC Binding tab of the page that appears, click Add Binding. In the Add VPC Binding panel, configure the parameters to associate the resource group with a VPC.
      • VPC: Select the VPC in which the EMR Hive data source resides.
      • Zone and VSwitch: We recommend that you select the zone and vSwitch in which the EMR Hive data source resides. If the zone in which the data source resides is not displayed in the drop-down list, you can select a random zone and a vSwitch. Make sure that the vSwitch that you select provides a connection to the data source.
      • Security Groups: Select a security group to which your EMR cluster belongs. The selected security group must meet the following requirements:
        • The related ports of the EMR cluster are enabled in an inbound rule of the security group. In most cases, the ports that need to be enabled include ports 10000, 9093, and 8020. You can view the information about the security group on the Security Groups page in the ECS console.
        • The CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated is specified as an authorization object in the security group rule.
  2. Add a custom route for the exclusive resource group for Data Integration.
    Note If you select the zone and vSwitch in which the data source resides in the preceding substep, you can skip this substep. If you select a different zone or vSwitch, you must perform the operations in this substep to add a custom route for the exclusive resource group for Data Integration.
    1. Go to the Resource Groups page in the DataWorks console, find the exclusive resource group for Data Integration that you want to use, and then click Network Settings in the Actions column.
    2. On the VPC Binding tab of the page that appears, find the VPC association record and click Custom Route in the Actions column.
    3. In the Custom Route panel, click Add Route. In the Add Route dialog box, configure the parameters to add a custom route for the exclusive resource group for Data Integration.
      • Destination VPC: Select the region and VPC in which the EMR Hive data source resides.
      • Destination VSwitch: Select the vSwitch in which the EMR Hive data source resides.

Step 2: Configure a security group rule to allow access to the EMR Hive data source

If you establish a network connection between the exclusive resource group for Data Integration and the EMR Hive data source over the Internet, you must configure an inbound security group rule to allow access from the elastic IP address (EIP) of the resource group to the EMR Hive data source. In most cases, ports 10000, 9093, and 8020 of the related EMR cluster need to be enabled.

  1. To view the EIP of the exclusive resource group for Data Integration, find the resource group on the Exclusive Resource Groups tab of the Resource Groups page and click View Information in the Actions column. View the EIP of the exclusive resource group for Data Integration
  2. In the EMR console, find the desired node of your EMR cluster on the Nodes tab and click the ID of the node. On the page that appears, click the ID of the security group of the node. On the page that appears, add the obtained EIP to the security group.

Create a batch synchronization node

  1. Create a workflow. For more information, see Create a workflow.
  2. Create a batch synchronization node.
    You can use one of the following methods to create a batch synchronization node:
    • Method 1: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization node and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and click its name. Right-click Data Integration and choose Create Node > Offline synchronization.
    • Method 2: Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace in which you want to create a batch synchronization node and click DataStudio in the Actions column. In the Scheduled Workflow pane of the DataStudio page, find the created workflow and double-click its name. In the Data Integration section of the workflow configuration tab, click Offline synchronization.
  3. In the Create Node dialog box, configure the parameters to create a batch synchronization node.

Establish network connections

Select the source, destination, and exclusive resource group for Data Integration, and then establish network connections between the resource group and the data sources.
Note If a network connection cannot be established, you can configure the network connectivity as prompted or by referring to the related topic. For more information, see Establish a network connection between a resource group and a data source.

Configure the source and destination

Configure the source

Configure parameters related to the source. In this example, a batch synchronization node is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters.
ParameterDescription
Data SourceThe Hive data source that you added.
Table
Select the Hive table whose data you want to synchronize. The table of the data source in the production environment and that in the development environment must have the same schema.
Note The tables and table schemas of the Hive data source in the development environment are displayed. If the table that you select has a different schema from the table in the production environment, an error indicating that the table or a column does not exist occurs after you submit the node to the production environment.
Method to Read Data of Hive Data Source
  • Read Data based on HDFS Files: Hive Reader connects to a Hive metastore and obtains the storage path, format, and column delimiter of the HDFS file that corresponds to your Hive table. Then, Hive Reader reads data from the HDFS file.
  • Read Data by Using Hive JDBC: Hive Reader connects to HiveServer2 by using Hive Java Database Connectivity (JDBC) and reads data. This method allows you to specify a WHERE clause to filter data and execute SQL statements to read data.
Note If you read data based on HDFS files, the data synchronization efficiency is higher. If you read data by using Hive JDBC, MapReduce programs are generated. As a result, the data synchronization efficiency is lower. If you read data based on HDFS files, you cannot specify filter conditions or read views. You can select a synchronization method based on your business requirements.
parquet schemaIf you store the Hive table in the Parquet format, you must configure the parquet schema parameter.
Retain the default values for other parameters.

Configure the destination

Configure parameters related to the destination. In this example, a batch synchronization node is created to synchronize data in an EMR Hive database to MaxCompute. The following table describes the parameters.
ParameterDescription
Data SourceThe MaxCompute data source that you added or the automatically added MaxCompute data source. A sample name of an automatically added MaxCompute data source is odps_first.
TableThe name of the MaxCompute table to which you want to write data.
Partition informationThe partition information. If the table is a partitioned table, you can set this parameter to the value of the partition key column. The value can be a constant such as ds=20220101 or a scheduling system parameter such as ds=${bizdate}. If you specify a scheduling parameter as a value for this parameter, the system replaces the scheduling parameter with an actual value when you run the batch synchronization node.
Note Whether the Partition Key Column parameter needs to be configured and the number of times the system displays the Partition Key Column parameter vary based on the type of the destination MaxCompute table that you select. If you select a non-partitioned destination MaxCompute table, you do not need to configure this parameter. If you select a partitioned destination MaxCompute table, the system displays the Partition Key Column parameter based on the number of partition key columns in the table and the names of the partition key columns. The number of times that the system displays the Partition Key Column parameter is the same as the number of partition key columns in the table.
Retain the default values for other parameters.

Configure field mappings

After you configure the source and destination, you must configure the mappings between the fields in the source and the fields in the destination. You can click Map Fields with the Same Name, Map Fields in the Same Line, Delete All Mappings, or Auto Layout to perform the related operation.

Configure channel control policies

You can configure settings, such as the maximum number of parallel threads and the maximum number of dirty data records that are allowed.

Configure scheduling properties

Click Next. In the Configure Scheduling Settings step, configure the scheduling properties that are required to run the synchronization node. For more information about scheduling parameters, see Description for using scheduling parameters in data synchronization.
  • Configure the rerun property.

    You can configure a rerun policy based on your business requirements. Node rerunning can prevent node failures that are caused by occasional issues such as network jitters.

  • Configure scheduling dependencies.

    You can configure scheduling dependencies for the node based on your business requirements. You can configure the instance that is generated for the batch synchronization node in the current cycle to depend on the instance that is generated for the same node in the previous cycle. This way, you can ensure that instances generated for the node in different scheduling cycles can finish running in sequence and prevent multiple instances from running at the same time.

Note DataWorks uses resource groups for scheduling to issue batch synchronization nodes in Data Integration to resource groups for Data Integration. Then, DataWorks uses the resource groups for Data Integration to run the nodes. You are charged when you use the resource groups for scheduling to schedule batch synchronization nodes. For more information about the node issuing mechanism, see Mechanism for issuing nodes.

Test the batch synchronization node and commit and deploy the node for running

Test the batch synchronization node

In the top toolbar of the configuration tab of the batch synchronization node, you can click RunRun or Run with ParametersRun with Parameters to test the batch synchronization node and check whether the node can run as expected. You can click Run with Parameters to test whether the scheduling parameters that you configured for the node are replaced as expected.

Commit and deploy the batch synchronization node

If the batch synchronization node runs as expected, you can click the Save icon to save the node configurations and the Commit icon to commit the node to the production environment. If the workspace is in standard mode, you must click the Deploy icon to deploy the node to the production environment. The synchronization node writes data from Hive to a MaxCompute table by day, hour, or minute. For more information, see Deploy nodes.

After you deploy the batch synchronization node, you can view the running results of the node in Operation Center and perform operations on the node, such as data backfilling. For more information, see Perform basic O&M operations on auto triggered nodes.