All Products
Search
Document Center

ApsaraDB for HBase:Use DataWorks or DataX to import data

Last Updated:Mar 14, 2024

ApsaraDB for HBase Performance-enhanced Edition provides Lindorm Tunnel Service (LTS) for data migration and real-time data synchronization between ApsaraDB for HBase clusters of various versions. LTS also allows you to synchronize real-time data from ApsaraDB RDS or LogHub to ApsaraDB for HBase. DataX is an offline data synchronization tool that is widely used in the Alibaba Group. DataX synchronizes data between various heterogeneous data sources such as MySQL, Oracle, SQL Server, PostgreSQL, Hadoop Distributed File System (HDFS), Hive, AnalyticDB for MySQL, HBase, TableStore (OTS), MaxCompute (ODPS), and DRDS.

Prerequisites

  • The version of your ApsaraDB for HBase cluster is 2.4.3 or later. For more information about how to view or update the cluster version, see Minor version updates.

  • The IP address of a client is added to the whitelist of the ApsaraDB for HBase cluster. For more information, see Configure a whitelist and a security group.

Usage notes

  • An ApsaraDB for HBase cluster can be accessed over only a virtual private cloud (VPC).

    Important

    To access the ApsaraDB for HBase cluster over the Internet, update the SDK before you perform the data import operation. For more information, see Upgrade ApsaraDB for HBase SDK for Java.

  • If applications are deployed on an Elastic Compute Service (ECS) instance, and you want to access an ApsaraDB for HBase Performance-enhanced Edition cluster over a VPC, make sure that the ApsaraDB for HBase Performance-enhanced Edition cluster and the ECS instance meet the following requirements to ensure network connectivity:

    • The ApsaraDB for HBase cluster and the ECS instance are deployed in the same region. We recommend that you deploy the cluster and the instance in the same zone to reduce network latency.

    • The ApsaraDB for HBase cluster and the ECS instance belong to the same VPC.

Use DataX to synchronize data

You can use one of the following methods to configure DataX synchronization tasks:

  • Use the Data Integration service provided by Alibaba Cloud DataWorks to configure synchronization tasks in DataX. We recommend that you use this method.

  • Use the open source DataX to configure synchronization tasks.

Use DataWorks to configure synchronization tasks in DataX

  1. Create a workspace. For more information, see Create a workspace.

  2. Create a resource group. The following table describes different types of resource groups. We recommend that you use the resources from an exclusive resource group or a custom resource group to access an ApsaraDB for HBase cluster to synchronize data.

    Resource group type

    References

    Characteristic

    Remarks

    Exclusive resource group

    Exclusive resource group mode

    DataWorks automatically subscribes to and manages resources in an exclusive resource group. This ensures high service performance and availability.

    Exclusive resources cannot be shared across regions. For example, the exclusive resources in the China (Shanghai) region can be used by the workspace only in the China (Shanghai) region. The resources cannot be allocated to VPCs in other regions. You can use resources from an exclusive resource group to access ApsaraDB for HBase clusters that are attached to the same vSwitch.

    Custom resource group

    Create and use a custom resource group for Data Integration

    Only DataWorks Enterprise Edition and more advanced editions support custom resource groups. ECS instances that belong to a custom resource group can be purchased based on your business requirements. You can deploy ECS instances in the VPC in which the ApsaraDB for HBase cluster resides to access the ApsaraDB for HBase cluster over the VPC. If the ECS instances and the ApsaraDB for HBase cluster are not deployed in the same VPC, you can access the ApsaraDB for HBase cluster only over the Internet.

    You have all permissions on instances that belong to the custom resource group. If you want to use DataX to connect to or manage the ECS instances, you must install, manage, maintain, and update DataX based on your business requirements.

    Default resource group

    None

    You can use ECS instances that belong to the default resource group to access the ApsaraDB for HBase cluster only over the Internet.

    If you use DataWorks to access the ApsaraDB for HBase cluster over the Internet, excess costs are incurred.

  3. Configure the network for the exclusive resource group, custom resource group, or default resource group.

    • Configure the network for the exclusive resource group.

      1. Associate the exclusive resource group with the VPC in which the ApsaraDB for HBase cluster is deployed. For more information, see Exclusive resource group mode.

      2. In the VPC console, check the IPv4 CIDR block of the VPC with which the exclusive resource group is associated and the IPv4 CIDR block of the vSwitch. Add IPv4 addresses in the CIDR block to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

    • Configure the network for the custom resource group.

      The exact IP address of each ECS instance in the custom resource group is available because the instances are purchased. You can add all the IP addresses to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

    • Configure the network for the default resource group.

      For more information about the CIDR blocks of the instances in the default resource group, see step 5 in the "Add the IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides to the whitelist of a data source" section of the Configure an IP address whitelist topic. Add the CIDR blocks to the whitelist of the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configure a whitelist.

  4. Create a synchronization task and associate the instances with the resource group.

    1. Create a synchronization task. For more information, see Configure a batch synchronization task by using the codeless UI.

    2. Modify the configuration of HBase Writer and HBase Reader. The HBase Reader plug-in is used to read data from ApsaraDB for HBase clusters. The HBase Writer plug-in is used to write data to ApsaraDB for HBase clusters. For more information about the HBase Reader and HBase Writer plug-ins, see HBase Writer and HBase Reader.

      For more information about how to configure the plug-ins, see the help documentation for the plug-ins. You must set the endpoint parameter instead of the Zookeeper.quorum parameter when you configure the hbaseconfig part of an ApsaraDB for HBase Performance-enhanced Edition cluster. The following sample code provides an example on how to modify the configuration.

      "hbaseConfig": {
        "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection",
        "hbase.client.endpoint" : "host:30020",
        "hbase.client.username" : "testuser",
        "hbase.client.password" : "password"
      }
      Note
      • The value of the hbase.client.connection.impl parameter is fixed.

      • The hbase.client.endpoint parameter specifies the Java API endpoint provided in the ApsaraDB for HBase console. You can use the Java API endpoint to access an ApsaraDB for HBase Performance-enhanced Edition cluster. For more information about how to obtain the Java API endpoint, see Access a cluster.

      • The hbase.client.username and hbase.client.password parameters specify the user-defined username and password. Make sure that the account has the read and write permissions on ApsaraDB for HBase Performance-enhanced Edition tables. The default username and password are root. This account has the read and write permissions on all tables.

      • Select ApsaraDB for HBase V1.1.x as the ApsaraDB for HBase version.

Use the open source DataX to configure synchronization tasks

  1. Download the DataX installation package from the official website of DataX and decompress it.

  2. Edit the hbase11xreader and hbase11xwriter plugins.

    In DataX, the hbase11xreader plug-in is used to read data from the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information about how to configure the plug-in, see Configuration examples. The hbase11xwriter plug-in is used to write data to the ApsaraDB for HBase Performance-enhanced Edition cluster. For more information, see Configuration examples. Sample configuration:

    ...
    "hbaseConfig": {
      // The endpoint (VPC endpoint) of the cluster.
      "hbase.zookeeper.quorum": "ld-bp150tns0sjxs****-proxy-hbaseue.hbaseue.rds.aliyuncs.com:30020"
    }
    ...
                        
  3. Use DataX to migrate data. For more information about how to use DataX, see Official DataX documentation.