ApsaraDB for HBase provides Big Datahub Service (BDS) for scenarios such as data migration and real-time data synchronization between various HBase versions. BDS also allows you to synchronize the real-time data from Relational Database Service (RDS) and LogHub to ApsaraDB for HBase. For more information, see BDS introduction. BDS does not support importing data from heterogeneous data sources such as MaxCompute (formerly ODPS). To import data from these data sources, you must use DataX. DataX is an offline data synchronization tool that is widely used within Alibaba Group. DataX synchronizes data between various heterogeneous data sources such as MySQL, Oracle, SQL Server, Postgre, HDFS, Hive, ADS, HBase, TableStore (OTS), MaxCompute (ODPS), and DRDS.

Use DataX for data synchronization

You can use one of the following methods to configure DataX synchronization tasks: 1. Use the Data Integration service provided by Alibaba Cloud DataWorks to configure synchronization tasks in DataX. 2. Use the open source edition of DataX to configure synchronization tasks.

Method 1: Use DataWorks to specify the parameters of DataX.

Create a workspace

For more information, see Create a workspace.

Create a resource group

Type Configuration guide Feature Remarks
Exclusive resource group Exclusive resource mode DataWorks automatically subscribes to and manages exclusive resources. This guarantees service performance and availability. Exclusive resources cannot be shared among regions. For example, the exclusive resources in the China (Shanghai) region can only be used by the workspace in the China (Shanghai) region. The resources cannot be bound to the Virtual Private Cloud (VPC) network in other regions. You can use exclusive resources to connect only to ApsaraDB for HBase clusters attached to the same Virtual Switch (VSwitch).
Custom resource group Custom resource group Only DataWorks Enterprise Edition and later support custom resource groups. Elastic Compute Service (ECS) instances that belong to a custom resource group can be purchased based on your needs. You can deploy ECS instances in the VPC network to access HBase over the internal network. Otherwise, you can only access ApsaraDB for HBase over a public network. You have all permissions on instances that belong to the custom resource group. If you want to log on or manage the instances, you must install, manage, maintain, and upgrade DataX as needed. For more information, see Custom resource group.
Default resource group N/A You can use instances that belong to the default resource group to access ApsaraDB for HBase in the VPC network over a public network instead of the internal network. If you access ApsaraDB for HBase over a public network, additional costs are occurred for DataWorks. For more information, see Pay-as-you-go.

We recommend that you use the resources in the exclusive resource group or the custom resource group to connect to ApsaraDB for HBase for data synchronization. For more information, see Custom resource group.

Configure network parameters

Configure the network for the exclusive resource group

  1. Bind the exclusive resource group to the VPC network of ApsaraDB for HBase. For more information, see Bind a VPC network.
  2. Check the CIDR block of VPC and VSwitch exclusive resource in the VPC console. The following figure shows the sample CIDR block 192.168.0.0/24. If you do not know the exact IP address of the instance in the exclusive resource group, add the CIDR block to the ApsaraDB for HBase whitelist before you connect to ApsaraDB for HBase. vswitch
  3. Add the CIDR block to the ApsaraDB for HBase whitelist. For more information, see Configure a whitelist.

Configure the network for the custom resource group

If you know the exact IP address of each ECS instance in the custom resource group, add all the IP addresses to the ApsaraDB for HBase whitelist. For more information, see Configure a whitelist.

Configure the network for the default resource group

Retrieve the CIDR block of the instance in the default resource group. For more information, see Add whitelist. Add the CIDR block of the region to the ApsaraDB for HBase whitelist. For more information, see Configure a whitelist.

Create a synchronization task and bind a resource group

  1. Create a synchronization task. For more information, see Configure a data synchronization node by using the codeless UI.
  2. Modify the plug-in configurations. Use the hbase11xwritter plug-in for writes and use the hbase11xreader plug-in for reads.

For more information, see the help documentation. You must specify the endpoint parameter instead of the Zookeeper.quorum parameter when you configure the hbaseConfig parameter in ApsaraDB for HBase Performance-enhanced Edition. The following sample code shows how to configure the connection.

"hbaseConfig": {
  "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection",
  "hbase.client.endpoint" : "host:30020",
  "hbase.client.username" : "root",
  "hbase.client.password" : "root"
}
			

Notes:

  • The hbase.client.connection.impl parameter is a fixed configuration. You do not need to change it.
  • The parameter hbase.client.endpoint specifies the endpoint used for the Java API. For more information, see Connect to a cluster.
  • The hbase.client.username and hbase.client.password parameters specify the user-defined username and password. Make sure that the account has the read and write permissions on ApsaraDB for HBase Performance-enhanced Edition tables. The default username and password are root. This account has the read and write permissions on all tables. For more information about users and ACLs, see Connect to a cluster.

3. Choose the resource group that you have created as the task resource.

Method 1: Use the open-source DataX to specify the parameters.

Download the DataX installation package

Click here to download and decompress the TAR file of DataX. This TAR file includes the JAR file that is required to access ApsaraDB for HBase Performance-enhanced Edition.

If you have installed DataX or downloaded the latest DataX version from GitHub, you must add the required JAR file. Follow these steps:

Download only the latest JAR file of alihbase-connector version 1.x from the Plug-ins used to connect to ApsaraDB for HBase Performance-enhanced Edition section in Install the SDK for Java. You do not need to download the entire compressed file. Save the JAR file in the datax/plugin/writer/hbase11xwriter/libs directory. If you want to use DataX to read data in ApsaraDB for HBase Performance-enhanced Edition, save the JAR file in datax/plugin/reader/hbase11xreader/libs directory.

Modify the configuration file

In DataX, the hbase11xreader plug-in is used to read the data in ApsaraDB for HBase Performance-enhanced Edition. For more information, see Documentation. The hbase11xwriter plug-in is used to write data in ApsaraDB for HBase Performance-enhanced Edition. For more information, see Documentation. The configurations of reads and writes in ApsaraDB for HBase Performance-enhanced Edition are the same as the official configuration, except for hbaseConfig. You must specify the endpoint parameter instead of the Zookeeper.quorum parameter when you configure the hbaseConfig in ApsaraDB for HBase Performance-enhanced Edition. The following sample code shows how to configure the connection.

...
"hbaseConfig": {
                            "hbase.client.connection.impl" : "com.alibaba.hbase.client.AliHBaseUEConnection",
                            "hbase.client.endpoint" : "host:30020",
                            "hbase.client.username" : "root",
                            "hbase.client.password" : "root"
                        }
...
			

You can set this parameter to com.alibaba.hbase.client.AliHBaseUEConnection to use the Connection object of ApsaraDB for HBase Performance-enhanced Edition. The parameter hbase.client.endpoint specifies the endpoint used for the Java API. For more information, see Connect to a cluster. The hbase.client.username and hbase.client.password parameters specify the user-defined username and password. Make sure that the account has the read and write permissions on ApsaraDB for HBase Performance-enhanced Edition tables. The default username and password are root. This account has the read and write permissions on all tables. For more information about users and ACLs, see Connect to a cluster.

Launch DataX to migrate data

For more information, see the official documentation DataX.

Note:

Before you migrate the data, you must add IP addresses of ECS instances in the whitelist. For more information, see Connect to a cluster. If ECS instances and ApsaraDB for HBase Performance-enhanced Edition instances·are not in the same VPC, you must use the public endpoint.