If you experience high latency when you perform interactive big data analytics and queries on Hadoop, you can synchronize the data to Alibaba Cloud Elasticsearch for faster queries and analysis. Elasticsearch can respond to multiple types of queries, especially ad hoc queries, in seconds. This topic describes how to use the Data Integration service of DataWorks to synchronize large amounts of data from Hadoop to Alibaba Cloud ES.
Background information
DataWorks is an end-to-end big data development and governance platform based on big data compute engines. DataWorks provides features such as data development, task scheduling, and data management. You can create synchronization tasks in DataWorks to rapidly synchronize data from various data sources to Alibaba Cloud Elasticsearch.
The following types of data sources are supported:
Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase
Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS)
Alibaba Cloud MaxCompute
Alibaba Cloud Object Storage Service (OSS)
Alibaba Cloud Tablestore
Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase
The following synchronization scenarios are supported:
Synchronize big data from a database or table to Alibaba Cloud Elasticsearch in offline mode. For more information, see Create a batch synchronization task to synchronize all data in a database to Elasticsearch.
Synchronize full and incremental big data to Alibaba Cloud Elasticsearch in real time. For more information, see Create a real-time synchronization task to synchronize data to Elasticsearch.
Prerequisites
An Alibaba Cloud Elasticsearch cluster is created and the Auto Indexing feature is enabled for the cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file.
NoteYou can synchronize data only to an Alibaba Cloud ES instance. Self-managed Elasticsearch clusters are not supported.
A DataWorks workspace is created. For more information, see Create a workspace.
A Hadoop cluster exists and contains data.
The Hadoop cluster, ES instance, and DataWorks workspace must be in the same region.
The Hadoop cluster, ES instance, and DataWorks workspace must be in the same time zone. Otherwise, if you synchronize time-related data, the data in the source and destination may have a time zone difference.
Billing
For information about the billing of Alibaba Cloud Elasticsearch clusters, see Elasticsearch billable items.
For information about the billing of exclusive resource groups for Data Integration, see Billing of exclusive resource groups for Data Integration (subscription).
Procedure
Step 1: Purchase and create an exclusive resource group
Purchase an exclusive resource group for Data Integration and associate the resource group with a VPC and a workspace. An exclusive resource group ensures fast and stable data transmission.
Log on to the DataWorks console.
In the top navigation bar, select a region. In the navigation pane on the left, click Resource Group.
On the Exclusive Resource Groups tab, click .
On the DataWorks Exclusive Resources (Subscription) purchase page, set Exclusive Resource Type to Exclusive Data Integration Resources, enter a resource group name, and click Buy Now.
For more information, see Step 1: Create an exclusive resource group for Data Integration.
In the Actions column of the exclusive resource group that you created, click Network Settings to attach a virtual private cloud (VPC) to the resource group. For more information, see Attach a VPC.
NoteIn this example, an exclusive resource group for Data Integration is used to synchronize data over a VPC. For information about how to use an exclusive resource group for Data Integration to synchronize data over the Internet, see Configure an IP address whitelist.
The exclusive resource group must be connected to the VPC where the Hadoop cluster resides and the VPC where the Elasticsearch cluster resides. This allows data to be synchronized using the exclusive resource group. Therefore, you must associate the exclusive resource group with the VPC, Zone, and vSwitch of the Hadoop cluster and the Elasticsearch cluster. For information about how to view the VPC, zone, and vSwitch of the Elasticsearch cluster, see View the basic information of a cluster.
ImportantAfter you attach a VPC, you must add the vSwitch CIDR Block of the VPC to the internal-facing access whitelists of the Hadoop cluster and the ES instance. For more information, see Configure a public or private access whitelist for an ES instance.
Click the back icon in the upper-left corner of the page to return to the Resource Groups page.
In the Actions column of the exclusive resource group that you created, click Attach Workspace to attach the resource group to a target workspace.
For more information, see Step 2: Associate the exclusive resource group for Data Integration with a workspace.
Step 2: Add data sources
Go to the Data Integration page.
Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
Find the workspace and choose in the Actions column.
In the navigation pane on the left, click Data Source.
Add a Hadoop Distributed File System (HDFS) data source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, search for and select HDFS.
On the Add HDFS Data Source page, configure the data source parameters.
For more information, see Add an HDFS data source.
Click Test Connectivity. A status of Connected indicates that the connection is successful.
Click Complete.
Add an Elasticsearch data source in the same way. For more information, see Add an Elasticsearch data source.
Step 3: Configure and run a batch data synchronization task
A batch synchronization task runs using the exclusive resource group. The resource group retrieves data from the source and writes the data to the ES instance.
You can use the codeless UI or code editor to configure the batch synchronization task. In this example, the codeless UI is used. For information about how to use the code editor to configure the batch synchronization task, see Configure a batch synchronization task using the code editor and Elasticsearch Writer.
This topic uses legacy Data Development (DataStudio) as an example to demonstrate how to create an offline sync task.
Go to the DataStudio page of DataWorks.
Log on to the DataWorks console.
In the left-side navigation pane, click Workspace.
Find the workspace and choose in the Actions column.
Create a batch synchronization task.
In the left-side navigation pane, choose to create a workflow.
Right-click the name of the newly created workflow and choose .
In the Create Node dialog box, configure the Name parameter and click Confirm.
Configure the network and resources.
In the Data Source section, set Data Source to HDFS and Data Source Name to the name of the data source that you want to synchronize.
In the My Resource Group section, select an exclusive resource group.
In the Data Destination section, set Data Destination to ES and Data Source Name to the name of the data source that you want to synchronize.
Click Next.
Configure the task.
In the Source section, select the table whose data you want to synchronize.
In the Data Destination section, configure the parameters.
In the Field Mapping section, map the Source Fields to the Destination Fields.
In the Channel Control section, configure the channel parameters.
For more information, see Configure an offline sync task using the codeless UI.
Run the task.
(Optional) Configure scheduling properties for the task. On the right side of the page, click Scheduling Configuration and configure the scheduling parameters as required. For more information about the parameters, see Scheduling Configuration.
In the upper-left corner of the node configuration tab, click the Save icon to save the task.
In the upper-left corner of the node configuration tab, click the Submit icon to submit the task.
If you configured scheduling properties for the task, the task runs automatically at the scheduled intervals. You can also click the Run icon in the upper-left corner of the node configuration tab to run the task immediately.
If the
Shell run successfully!message appears in the run log, the task ran successfully.
Step 4: Verify the data synchronization result
Log on to the Kibana console of the destination Alibaba Cloud ES instance.
For more information, see Log on to the Kibana console.
In the navigation pane on the left, click Dev Tools.
In the Console, run the following command to view the synchronized data:
POST /hive_esdoc_good_sale/_search?pretty { "query": { "match_all": {}} }Notehive_esdoc_good_saleis the value that you set for theindexfield in the data synchronization script.If the data is synchronized, the following result is returned.
