Synchronize HBase data to Alibaba Cloud ES - Elasticsearch

You can use Alibaba Cloud Elasticsearch to search and analyze data in HBase. This topic describes how to use Data Integration in DataWorks to perform offline synchronization of data to an Alibaba Cloud ES instance.

Background information

DataWorks is an end-to-end big data development and governance platform based on big data compute engines. DataWorks provides features such as data development, task scheduling, and data management. You can create synchronization tasks in DataWorks to rapidly synchronize data from various data sources to Alibaba Cloud Elasticsearch.

The following types of data sources are supported:
- Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase
- Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS)
- Alibaba Cloud MaxCompute
- Alibaba Cloud Object Storage Service (OSS)
- Alibaba Cloud Tablestore
- Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase
The following synchronization scenarios are supported:
- Synchronize big data from a database or table to Alibaba Cloud Elasticsearch in offline mode. For more information, see Create a batch synchronization task to synchronize all data in a database to Elasticsearch.
- Synchronize full and incremental big data to Alibaba Cloud Elasticsearch in real time. For more information, see Create a real-time synchronization task to synchronize data to Elasticsearch.

Prerequisites

An ApsaraDB for HBase cluster is created. For more information, see Purchase a cluster.
An Alibaba Cloud Elasticsearch cluster is created, and the Auto Indexing feature is enabled for the cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file.
A DataWorks workspace is created. For more information, see Create a workspace.

Note

You can synchronize data only to Alibaba Cloud ES. Self-managed Elasticsearch instances are not supported.
The HBase instance, ES instance, and DataWorks workspace must be in the same region.
The HBase instance, ES instance, and DataWorks workspace must be in the same time zone. Otherwise, a time zone difference may occur between the source data and the synchronized data when you synchronize time-related data.

Billing

For information about the billing of Alibaba Cloud Elasticsearch clusters, see Elasticsearch billable items.
For information about the billing of exclusive resource groups for Data Integration, see Billing of exclusive resource groups for Data Integration (subscription).

Procedure

Step 1: Prepare source data

This example uses the following table creation statement and test data. For more information about how to import data into an HBase cluster, see Use HBase Shell to access.

Table creation statement

create 'student', {NAME => 'name'}, {NAME => 'ID'}, {NAME => 'gender'}

Test data
Use the put command to insert data into the table. Example: put 'student', 'row1', 'name:a', 'xiaoming'.
Use the scan command to view the data in the table. Example: scan 'student'.

Step 2: Purchase and configure an exclusive resource group

Purchase an exclusive resource group for Data Integration and attach a virtual private cloud (VPC) and a workspace to the resource group. Exclusive resource groups ensure fast and stable data transmission.

Log on to the DataWorks console.
In the top navigation bar, select a region. In the left navigation pane, click Resource Group.
On the Exclusive Resource Groups tab, click Create Legacy Resource Group > Data Integration Resource Group.
On the DataWorks Exclusive Resources (Subscription) purchase page, set Resource Type to Exclusive Resource Group For Data Integration, enter a name for the resource group, and then click Buy Now.
For more information, see Step 1: Create an exclusive resource group for Data Integration.
In the Operation column for the exclusive resource group that you created, click Network Settings to attach a virtual private cloud (VPC).
Note
In this example, an exclusive resource group for Data Integration is used to synchronize data over a VPC. For more information, see Configure an IP address whitelist.
To synchronize data, an exclusive resource must be connected to the Virtual Private Clouds (VPCs) of the HBase and Elasticsearch instances. Therefore, you must attach the exclusive resource to the VPC, Zone, and vSwitch of both the HBase and Elasticsearch instances. For more information about the VPC of the ES instance, see View the basic information of an Elasticsearch instance.
Important
After you attach a virtual private cloud (VPC), you must add the vSwitch CIDR Block from the VPC to the private IP address whitelists for the HBase and Elasticsearch clusters. For more information, see Configure a public or private IP address whitelist for an Elasticsearch cluster.
In the upper-left corner, click the back icon to return to the Resource Group List page.
In the Operation column for the exclusive resource group, click Attach Workspace to attach the destination workspace.
For more information, see Step 2: Associate the exclusive resource group for Data Integration with a workspace.

Step 3: Add data sources

Add the HBase and Elasticsearch data sources to Data Integration in DataWorks.

Go to the Data Integration page of DataWorks.
1. Log on to the DataWorks console.
2. In the left navigation pane, click Workspace.
3. In the Actions column of the destination workspace, choose Shortcuts > Data Integration.
In the left navigation pane on the Data Integration page, click Data Source.
Add an HBase data source.
1. On the Data Source List page, click Add Data Source.
2. In the Add Data Source dialog box, search for and select HBase.
3. In the Add HBase Data Source dialog box, you can configure the data source parameters in the Basic Information section.
  For more information, see Configure an HBase data source.
4. In the Connection Configuration section, click Test Connectivity. A status of Connected confirms a successful connection.
5. Click Complete.
Add an Elasticsearch data source in the same way. For more information, see Add an Elasticsearch data source.

Step 4: Configure and run an offline sync task

An offline sync task runs using an exclusive resource group. The exclusive resource group retrieves data from the data source in Data Integration and writes the data to the ES instance.

Note

You can configure a batch synchronization task using the codeless UI or the code editor. This topic describes how to configure a batch synchronization task using the codeless UI. For information about how to use the code editor, see Configure a batch synchronization task using the code editor and Elasticsearch Writer.
This topic uses the legacy Data Development (DataStudio) page as an example to demonstrate how to create an offline sync task.

Go to the Data Development page of DataWorks.
1. Log on to the DataWorks console.
2. In the left navigation pane, click Workspace.
3. In the Actions column for the destination workspace, choose Shortcuts > Data Development.
Create an offline sync task.
1. On the Data Development ( icon) tab in the navigation pane on the left, choose Create > Create Workflow and follow the on-screen instructions to create a workflow.
2. Right-click the workflow that you created and choose Create Node > Data Integration > Offline Synchronization.
3. In the Create Node dialog box, enter a node name and click Confirm.
Configure the network and resources.
1. In the Source section, set Source to HBase and Data Source Name to the name of the data source.
2. For Resource Group, select the exclusive resource group.
3. In the Destination section, set Destination to Elasticsearch and Data Source Name to the destination data source.
Click Next.
Configure the task.
1. In the Source section, select the table to synchronize.
2. You can configure the parameters in the Destination section.
3. In the Field Mapping section, configure the mappings between Source Fields and Destination Fields. For more information, see Configure an offline sync task using the codeless UI.
4. In the Channel Control section, you can configure the channel parameters.
For more information, see Configure a batch synchronization task using the codeless UI.
Run the task.
1. (Optional) On the right side of the page, click Scheduling Configuration to configure the scheduling parameters as required. For more information, see Scheduling Configuration.
2. Above the node editor, click the Save icon to save the task.
3. Above the node editor, click the Submit icon to submit the task.
  If you configure scheduling properties for the task, the task runs periodically. You can also click the Run icon in the upper-right corner of the node editor to immediately run the task.
  If Shell run successfully! appears in the operational log, the task ran successfully.

Step 5: Verify the data synchronization result

Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted.
For more information about how to log on to the Kibana console, see Log on to the Kibana console.
Note
In this example, an Elasticsearch V7.10.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
In the upper-right corner of the page that appears, click Dev tools.
On the Console tab, run the following command to view the synchronized data.
```
POST /student_info/_search?pretty
{
   "query": { "match_all": {}}
}
```
Note
student_info is the index name that you set for the destination in the offline sync task.