ApsaraDB for HBase provides Big Datahub Service (BDS) for scenarios such as data migration and real-time data synchronization between various HBase versions. BDS also allows you to synchronize the real-time data from Relational Database Service (RDS) and LogHub to ApsaraDB for HBase. For more information, see BDS introduction. BDS does not support importing data from heterogeneous data sources such as MaxCompute (formerly ODPS). To import data from these data sources, you must use DataX. DataX is an offline data synchronization tool that is widely used within Alibaba Group. DataX synchronizes data between various heterogeneous data sources such as MySQL, Oracle, SQL Server, Postgre, HDFS, Hive, ADS, HBase, TableStore (OTS), MaxCompute (ODPS), and DRDS.

Before you migrate the data, you must add IP addresses of ECS instances in the whitelist.

Use DataX for data synchronization

You can use one of the following methods to configure DataX synchronization tasks:
  • Use the Data Integration service provided by Alibaba Cloud DataWorks to configure synchronization tasks in DataX.
  • Use the open source edition of DataX to configure synchronization tasks.

Use DataWorks to specify the parameters of DataX

  1. Create a workspace, For more information, see Create a workspace.
  2. Create a resource group.

    We recommend that you use the resources in the exclusive resource group or the custom resource group to connect to ApsaraDB for HBase for data synchronization.

    Type Configuration guide Feature Remarks
    Exclusive resource group Exclusive resource group mode DataWorks automatically subscribes to and manages exclusive resources. This guarantees service performance and availability. Exclusive resources cannot be shared among regions. For example, the exclusive resources in the China (Shanghai) region can only be used by the workspace in the China (Shanghai) region. The resources cannot be bound to the Virtual Private Cloud (VPC) network in other regions. You can use exclusive resources to connect only to ApsaraDB for HBase clusters attached to the same Virtual Switch (VSwitch).
    Custom resource group Create a custom resource group for Data Integration Only DataWorks Enterprise Edition and later support custom resource groups. Elastic Compute Service (ECS) instances that belong to a custom resource group can be purchased based on your needs. You can deploy ECS instances in the VPC network to access HBase over the internal network. Otherwise, you can only access ApsaraDB for HBase over a public network. You have all permissions on instances that belong to the custom resource group. If you want to log on or manage the instances, you must install, manage, maintain, and upgrade DataX as needed. For more information, see Custom resource group.
    Default resource group N/A You can use instances that belong to the default resource group to access ApsaraDB for HBase in the VPC network over a public network instead of the internal network. If you access ApsaraDB for HBase over a public network, additional costs are occurred for DataWorks.
  3. Configure network parameters
    • Configure the network for the exclusive resource group
      1. Bind the exclusive resource group to the VPC network of ApsaraDB for HBase. For more information, see Exclusive resource group mode.
      2. Check the CIDR block of VPC and VSwitch exclusive resource in the VPC consoleCheck the CIDR block of VPC and VSwitch
      3. Add the CIDR block to the ApsaraDB for HBase whitelist.
    • Configure the network for the custom resource group

      If you know the exact IP address of each ECS instance in the custom resource group, add all the IP addresses to the ApsaraDB for HBase whitelist.

    • Configure the network for the default resource group

      Retrieve the CIDR block of the instance in the default resource group.

  4. Create a synchronization task and bind a resource group
    1. Create a synchronization task. For more information, see Create a sync node by using the codeless UI.
    2. Modify the plug-in configurations. Use the hbase11xwritter plug-in for writes and use the hbase11xreader plug-in for reads. HBase Writer and Configure HBase Reader.
      For more information, see the help documentation. You must specify the endpoint parameter instead of the Zookeeper.quorum parameter when you configure the hbaseConfig parameter in ApsaraDB for HBase Performance-enhanced Edition. The following sample code shows how to configure the connection:
      
                "hbaseConfig": {
                                "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection", 
                                "hbase.client.endpoint" : "host:30020", 
                                "hbase.client.username" : "root", 
                                "hbase.client.password" : "root"
              } 
              
      Note
      • hbase.client.connection.impl: a fixed configuration. You do not need to change it.
      • hbase.client.endpoint: the endpoint used for the Java API. For more information, see Connect to a cluster.
      • hbase.client.username and hbase.client.password: specify the user-defined username and password. Make sure that the account has the read and write permissions on ApsaraDB for HBase Performance-enhanced Edition tables. The default username and password are root. This account has the read and write permissions on all tables.

Use the open-source DataX to specify the parameters

  1. Download the DataX installation package

    Click here.

    If you have installed DataX or downloaded the latest DataX version from GitHub.

    Download only the latest JAR file of alihbase-connector version 1.x from the Plug-ins used to connect to ApsaraDB for HBase Performance-enhanced Edition section in Install the SDK for Java. You do not need to download the entire compressed file. Save the JAR file in the datax/plugin/writer/hbase11xwriter/libs directory. If you want to use DataX to read data in ApsaraDB for HBase Performance-enhanced Edition, save the JAR file in datax/plugin/reader/hbase11xreader/libs directory.

  2. Modify the configuration file

    In DataX, the hbase11xreader , plug-in is used to read the data in ApsaraDB for HBase Performance-enhanced Edition. For more information, see Document, The hbase11xwriter ,plug-in is used to write data in ApsaraDB for HBase Performance-enhanced Edition. For more information, see Document.The configurations of reads and writes in ApsaraDB for HBase Performance-enhanced Edition are the same as the official configuration, except for hbaseConfig. You must specify the endpoint parameter instead of the Zookeeper.quorum parameter when you configure the hbaseConfig in ApsaraDB for HBase Performance-enhanced Edition. The following sample code shows how to configure the connection.

    
           ... 
          "hbaseConfig": { 
                               "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection", 
                               "hbase.client.endpoint" : "host:30020", 
                               "hbase.client.username" : "root", 
                               "hbase.client.password" : "root" 
                               } 
           ... 
         

    You can set this parameter to com.alibaba.hbase.client.AliHBaseUEConnection to use the Connection object of ApsaraDB for HBase Performance-enhanced Edition. The parameter hbase.client.endpoint specifies the endpoint used for the Java API.

  3. Launch DataX to migrate data For more information, see the official documentation DataX.