Use Data Integration in DataWorks to run an offline synchronization task that loads data from ApsaraDB for HBase into Alibaba Cloud Elasticsearch for search and analysis.
By the end of this tutorial, you will have:
-
An exclusive resource group connected to the virtual private clouds (VPCs) of both your HBase and Elasticsearch clusters
-
HBase and Elasticsearch registered as data sources in Data Integration
-
A running offline sync task that writes HBase data into an Elasticsearch index
How it works
DataWorks is an end-to-end big data development and governance platform based on big data compute engines, featuring data development, task scheduling, and data management. Data Integration retrieves data from the HBase source and writes it to the Elasticsearch cluster through an exclusive resource group. The exclusive resource group connects to both clusters over their VPCs, ensuring stable and secure data transfer.
Synchronization runs in offline (batch) mode. For real-time synchronization, see Create a real-time synchronization task to synchronize data to Elasticsearch.
The following data sources are supported by DataWorks Data Integration:
-
Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase
-
Alibaba Cloud services: PolarDB for Xscale (PolarDB-X) (formerly DRDS), MaxCompute, Object Storage Service (OSS), and Tablestore
-
Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase
Prerequisites
Before you begin, ensure that you have:
-
An ApsaraDB for HBase cluster. For setup instructions, see Purchase a cluster.
-
An Alibaba Cloud Elasticsearch cluster with Auto Indexing enabled. To create a cluster, see Create an Alibaba Cloud Elasticsearch cluster. To enable Auto Indexing, see Configure the YML file.
-
A DataWorks workspace. See Create a workspace.
Constraints:
-
Only Alibaba Cloud Elasticsearch is supported as the sync destination. Self-managed Elasticsearch instances are not supported.
-
The HBase cluster, Elasticsearch cluster, and DataWorks workspace must be in the same region and the same time zone. A time zone mismatch causes incorrect timestamps in synchronized time-related data.
Billing
-
For Elasticsearch cluster pricing, see Elasticsearch billable items.
-
For exclusive resource group pricing, see Billing of exclusive resource groups for Data Integration (subscription).
Step 1: Prepare source data
This tutorial uses a table named student with three column families: name, ID, and gender.
Create the table in HBase Shell:
create 'student', {NAME => 'name'}, {NAME => 'ID'}, {NAME => 'gender'}
Insert test data using put commands. For example:
put 'student', 'row1', 'name:a', 'xiaoming'
Verify the data with scan 'student'.
For more details on loading data, see Use HBase Shell to access.
Step 2: Purchase and configure an exclusive resource group
An exclusive resource group handles data transfer between your HBase cluster, Elasticsearch cluster, and DataWorks. It must be connected to the VPCs of both clusters.
-
Log on to the DataWorks console.
-
In the top navigation bar, select your region. In the left navigation pane, click Resource Group.
-
On the Exclusive Resource Groups tab, click Create Legacy Resource Group > Data Integration Resource Group.
-
On the DataWorks Exclusive Resources (Subscription) page, set Resource Type to Exclusive Resource Group For Data Integration, enter a name, and click Buy Now. For details, see Step 1: Create an exclusive resource group for Data Integration.
-
In the Operation column for the new resource group, click Network Settings to attach a VPC. The exclusive resource group must connect to the VPCs of both the HBase and Elasticsearch clusters. Attach the resource group to the VPC, Zone, and vSwitch of both clusters. To find the Elasticsearch cluster's VPC details, see View the basic information of an Elasticsearch instance.
ImportantAfter attaching a VPC, add the vSwitch CIDR block to the private IP address whitelists of both the HBase and Elasticsearch clusters. For the Elasticsearch cluster, see Configure a public or private IP address whitelist for an Elasticsearch cluster. For the exclusive resource group, see Configure an IP address whitelist.
-
Click the back icon to return to the Resource Group List page.
-
In the Operation column, click Attach Workspace to associate the resource group with your workspace. For details, see Step 2: Associate the exclusive resource group for Data Integration with a workspace.
Step 3: Add data sources
Register both HBase and Elasticsearch as data sources in Data Integration.
-
Open Data Integration.
-
Log on to the DataWorks console.
-
In the left navigation pane, click Workspace.
-
In the Actions column for your workspace, choose Shortcuts > Data Integration.
-
-
In the left navigation pane, click Data Source.
-
Add the HBase data source.
-
On the Data Source List page, click Add Data Source.
-
In the Add Data Source dialog box, search for and select HBase.
-
In the Add HBase Data Source dialog box, configure the parameters in the Basic Information section. For parameter descriptions, see Configure an HBase data source.
-
In the Connection Configuration section, click Test Connectivity. A Connected status confirms a successful connection.
-
Click Complete.
-
-
Add the Elasticsearch data source using the same steps. For details, see Add an Elasticsearch data source.
Step 4: Configure and run an offline sync task
This tutorial uses the codeless UI on the legacy Data Development (DataStudio) page. To use the code editor instead, see Configure a batch synchronization task using the code editor. For Elasticsearch-specific writer configuration, see Elasticsearch Writer.
-
Open Data Development.
-
Log on to the DataWorks console.
-
In the left navigation pane, click Workspace.
-
In the Actions column for your workspace, choose Shortcuts > Data Development.
-
-
Create an offline sync task.
-
On the Data Development (
) tab in the navigation pane, choose Create > Create Workflow and follow the prompts. -
Right-click the workflow and choose Create Node > Data Integration > Offline Synchronization.
-
In the Create Node dialog box, enter a node name and click Confirm.
-
-
Configure the network and resources.
-
In the Source section, set Source to HBase and Data Source Name to your HBase data source.
-
For Resource Group, select the exclusive resource group you created.
-
In the Destination section, set Destination to Elasticsearch and Data Source Name to your Elasticsearch data source.
-
-
Click Next.
-
Configure the task details.
-
In the Source section, select the table to synchronize.
-
Configure the parameters in the Destination section.
-
In the Field Mapping section, map Source Fields to Destination Fields. For details, see Configure a batch synchronization task using the codeless UI.
-
Configure the channel parameters in the Channel Control section.
-
-
Save and run the task. If you configured a schedule, the task runs periodically. To run the task immediately, click the Run icon in the upper-right corner of the node editor. When the operational log shows
Shell run successfully!, the task has completed.-
(Optional) Click Scheduling Configuration on the right panel to set up a recurring schedule. See Scheduling configuration.
-
Click the Save icon above the node editor.
-
Click the Submit icon to submit the task.
-
Step 5: Verify the data synchronization result
-
Log on to the Kibana console. For instructions, see Log on to the Kibana console.
NoteThis tutorial uses an Elasticsearch V7.10.0 cluster. Steps may differ for other versions. The actual operations in the console prevail.
-
In the upper-right corner, click Dev tools.
-
On the Console tab, run the following query to view the synchronized data:
POST /student_info/_search?pretty { "query": { "match_all": {}} }Notestudent_infois the index name that you set for the destination in the offline sync task.