All Products
Search
Document Center

ApsaraDB for HBase:Use DataWorks or DataX to import data

Last Updated:Apr 10, 2024

ApsaraDB for HBase Performance-enhanced Edition provides Lindorm Tunnel Service (LTS) for data migration and real-time data synchronization between ApsaraDB for HBase clusters of various versions. LTS also allows you to synchronize real-time data from ApsaraDB RDS or LogHub to ApsaraDB for HBase. DataX is an offline data synchronization tool that is widely used within Alibaba Group. You can use DataX to synchronize data between heterogeneous data sources in an efficient manner. These data sources include MySQL, Oracle, SQL Server, PostgreSQL, Hadoop Distributed File System (HDFS), Hive, AnalyticDB for MySQL, HBase, Tablestore, MaxCompute, and Distributed Relational Database Service (DRDS).

Prerequisites

  • The version of your ApsaraDB for HBase cluster is 2.4.3 or later. For more information about how to view or update the cluster version, see Minor version updates.

  • The IP address of a client is added to the whitelist of the ApsaraDB for HBase cluster. For more information, see Configure a whitelist and a security group.

Usage notes

  • An ApsaraDB for HBase cluster can be accessed only over a virtual private cloud (VPC).

    Important

    If you want to access an ApsaraDB for HBase cluster over the Internet, update the SDK before you perform the data import operation. For more information, see Use ApsaraDB for HBase SDK for Java to replace an open source HBase version with an ApsaraDB for HBase version.

  • If applications are deployed on an Elastic Compute Service (ECS) instance and you want to access an ApsaraDB for HBase cluster over a VPC, make sure that the ApsaraDB for HBase cluster and the ECS instance meet the following requirements to ensure network connectivity:

    • The ApsaraDB for HBase cluster and the ECS instance are deployed in the same region. We recommend that you deploy the cluster and the instance in the same zone to reduce network latency.

    • The ApsaraDB for HBase cluster and the ECS instance belong to the same VPC.

Use DataX to synchronize data

You can use DataX to synchronize data by using one of the following methods:

  • Use the Data Integration service provided by Alibaba Cloud DataWorks to configure synchronization tasks in DataX. We recommend that you use this method.

  • Use the open source DataX to configure synchronization tasks.

Use DataWorks to configure a synchronization task in DataX

  1. Create a workspace. For more information, see Create a workspace.

  2. Create a resource group. The following table describes different types of resource groups. We recommend that you use the resources from an exclusive resource group or a custom resource group to access an ApsaraDB for HBase cluster to synchronize data.

    Resource group type

    Reference

    Description

    Remarks

    Exclusive resource group

    Exclusive resource group mode

    DataWorks automatically subscribes to and manages resources in an exclusive resource group. This ensures high service performance and availability.

    Exclusive resources cannot be shared across regions. For example, the exclusive resources in the China (Shanghai) region can be used by the workspaces only in the China (Shanghai) region. The resources cannot be allocated to VPCs in other regions. You can use resources from an exclusive resource group to access ApsaraDB for HBase clusters that are attached to the same vSwitch.

    Custom resource group

    Create and use a custom resource group for Data Integration

    Only DataWorks Enterprise Edition and more advanced editions support custom resource groups. You need to purchase ECS instances for a custom resource group based on your business requirements. You can deploy ECS instances in the VPC in which the ApsaraDB for HBase cluster resides to access the ApsaraDB for HBase cluster over the VPC. If the ECS instances and the ApsaraDB for HBase cluster are not deployed in the same VPC, you can access the ApsaraDB for HBase cluster only over the Internet.

    You have all permissions on instances that belong to a custom resource group. If you want to use DataX to log on to or manage the ECS instances, you must install, manage, maintain, and update DataX based on your business requirements.

    Default resource group

    None

    You can use ECS instances that belong to the default resource group to access the ApsaraDB for HBase cluster only over the Internet.

    If you use DataWorks to access the ApsaraDB for HBase cluster over the Internet, excess costs are incurred.

  3. Configure the network for the exclusive resource group, custom resource group, or default resource group.

    • Configure the network for the exclusive resource group.

      1. Associate the exclusive resource group with the VPC in which the ApsaraDB for HBase cluster is deployed. For more information, see Exclusive resource group mode.

      2. Log on to the VPC console, check the IPv4 CIDR block of the VPC with which the exclusive resource group is associated and the IPv4 CIDR block of the vSwitch. Add IPv4 addresses in the CIDR block to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

    • Configure the network for the custom resource group.

      You need to purchase ECS instances for the custom resource group, check the IP address of each ECS instance, and add the IP addresses of the ECS instances to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

    • Configure the network for the default resource group.

      View the CIDR blocks of the instances in the default resource group. For more information, see the Add the IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides to an IP address whitelist of a data source section of the Configure an IP address whitelist topic. Add the CIDR blocks to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

  4. Create a synchronization task and associate the instances with the resource group.

    1. Create a synchronization task. For more information, see Configure a batch synchronization task by using the codeless UI.

    2. Modify the configurations of HBase Writer and HBase Reader. The HBase Reader plug-in is used to read data from ApsaraDB for HBase clusters. The HBase Writer plug-in is used to write data to ApsaraDB for HBase clusters. For more information, see HBase Writer and HBase Reader.

      You must set the endpoint parameter instead of the Zookeeper.quorum parameter when you configure hbaseConfig for an ApsaraDB for HBase Performance-enhanced Edition cluster. The following sample code provides an example on how to modify the configuration:

      "hbaseConfig": {
        "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection",
        "hbase.client.endpoint" : "host:30020",
        "hbase.client.username" : "testuser",
        "hbase.client.password" : "password"
      }
      Note
      • The value of the hbase.client.connection.impl parameter is fixed.

      • The hbase.client.endpoint parameter specifies the endpoint of the API for Java provided in the ApsaraDB for HBase console. You can use the endpoint to access an ApsaraDB for HBase Performance-enhanced Edition cluster. For more information about how to obtain the endpoint, see Access a cluster.

      • The hbase.client.username and hbase.client.password parameters specify the user-defined username and password. The default username and password are root. Make sure that the account has the read and write permissions on the tables that are stored in the ApsaraDB for HBase Performance-enhanced Edition cluster. The default root account has the read and write permissions on all tables that are stored in the ApsaraDB for HBase Performance-enhanced Edition cluster.

      • Use ApsaraDB for HBase 1.1.X.

Use the open source DataX to configure a synchronization task

  1. Download the DataX installation package from the official website of DataX and decompress it.

  2. Edit the hbase11xreader and hbase11xwriter plugins.

    In DataX, the hbase11xreader plug-in is used to read data from the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information about how to configure the plug-in, see Configuration examples. The hbase11xwriter plug-in is used to write data to the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information about how to configure the plug-in, see Configuration examples. Sample code:

    ...
    "hbaseConfig": {
      // The endpoint (VPC endpoint) of the cluster.
      "hbase.zookeeper.quorum": "ld-bp150tns0sjxs****-proxy-hbaseue.hbaseue.rds.aliyuncs.com:30020"
    }
    ...
                        
  3. Use DataX to migrate data. For more information about how to use DataX, see Official DataX documentation.