All Products
Search
Document Center

Elasticsearch:Use DataWorks to synchronize data from Hadoop to Alibaba Cloud ES

Last Updated:Aug 14, 2025

If you experience high latency when you perform interactive big data analytics and queries on Hadoop, you can synchronize the data to Alibaba Cloud Elasticsearch for faster queries and analysis. Elasticsearch can respond to multiple types of queries, especially ad hoc queries, in seconds. This topic describes how to use the Data Integration service of DataWorks to synchronize large amounts of data from Hadoop to Alibaba Cloud ES.

Background information

DataWorks is an end-to-end big data development and governance platform based on big data compute engines. DataWorks provides features such as data development, task scheduling, and data management. You can create synchronization tasks in DataWorks to rapidly synchronize data from various data sources to Alibaba Cloud Elasticsearch.

  • The following types of data sources are supported:

    • Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase

    • Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS)

    • Alibaba Cloud MaxCompute

    • Alibaba Cloud Object Storage Service (OSS)

    • Alibaba Cloud Tablestore

    • Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase

  • The following synchronization scenarios are supported:

Prerequisites

Note
  • A Hadoop cluster exists and contains data.

  • The Hadoop cluster, ES instance, and DataWorks workspace must be in the same region.

  • The Hadoop cluster, ES instance, and DataWorks workspace must be in the same time zone. Otherwise, if you synchronize time-related data, the data in the source and destination may have a time zone difference.

Billing

Procedure

Step 1: Purchase and create an exclusive resource group

Purchase an exclusive resource group for Data Integration and associate the resource group with a VPC and a workspace. An exclusive resource group ensures fast and stable data transmission.

  1. Log on to the DataWorks console.

  2. In the top navigation bar, select a region. In the navigation pane on the left, click Resource Group.

  3. On the Exclusive Resource Groups tab, click Create Old Version Resource Group > Data Integration Resource Group.

  4. On the DataWorks Exclusive Resources (Subscription) purchase page, set Exclusive Resource Type to Exclusive Data Integration Resources, enter a resource group name, and click Buy Now.

    For more information, see Step 1: Create an exclusive resource group for Data Integration.

  5. In the Actions column of the exclusive resource group that you created, click Network Settings to attach a virtual private cloud (VPC) to the resource group. For more information, see Attach a VPC.

    Note

    In this example, an exclusive resource group for Data Integration is used to synchronize data over a VPC. For information about how to use an exclusive resource group for Data Integration to synchronize data over the Internet, see Configure an IP address whitelist.

    The exclusive resource group must be connected to the VPC where the Hadoop cluster resides and the VPC where the Elasticsearch cluster resides. This allows data to be synchronized using the exclusive resource group. Therefore, you must associate the exclusive resource group with the VPC, Zone, and vSwitch of the Hadoop cluster and the Elasticsearch cluster. For information about how to view the VPC, zone, and vSwitch of the Elasticsearch cluster, see View the basic information of a cluster.

    Important

    After you attach a VPC, you must add the vSwitch CIDR Block of the VPC to the internal-facing access whitelists of the Hadoop cluster and the ES instance. For more information, see Configure a public or private access whitelist for an ES instance.

  6. Click the back icon in the upper-left corner of the page to return to the Resource Groups page.

  7. In the Actions column of the exclusive resource group that you created, click Attach Workspace to attach the resource group to a target workspace.

    For more information, see Step 2: Associate the exclusive resource group for Data Integration with a workspace.

Step 2: Add data sources

  1. Go to the Data Integration page.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspace.

    3. Find the workspace and choose Shortcuts > Data Integration in the Actions column.

  2. In the navigation pane on the left, click Data Source.

  3. Add a Hadoop Distributed File System (HDFS) data source.

    1. On the Data Sources page, click Add Data Source.

    2. In the Add Data Source dialog box, search for and select HDFS.

    3. On the Add HDFS Data Source page, configure the data source parameters.

      For more information, see Add an HDFS data source.

    4. Click Test Connectivity. A status of Connected indicates that the connection is successful.

    5. Click Complete.

  4. Add an Elasticsearch data source in the same way. For more information, see Add an Elasticsearch data source.

Step 3: Configure and run a batch data synchronization task

A batch synchronization task runs using the exclusive resource group. The resource group retrieves data from the source and writes the data to the ES instance.

Note
  1. Go to the DataStudio page of DataWorks.

    1. Log on to the DataWorks console.

    2. In the left-side navigation pane, click Workspace.

    3. Find the workspace and choose Shortcuts > Data Development in the Actions column.

  2. Create a batch synchronization task.

    1. In the left-side navigation pane, choose Create > Create Workflow to create a workflow.

    2. Right-click the name of the newly created workflow and choose Create Node > Offline synchronization.

    3. In the Create Node dialog box, configure the Name parameter and click Confirm.

  3. Configure the network and resources.

    1. In the Data Source section, set Data Source to HDFS and Data Source Name to the name of the data source that you want to synchronize.

    2. In the My Resource Group section, select an exclusive resource group.

    3. In the Data Destination section, set Data Destination to ES and Data Source Name to the name of the data source that you want to synchronize.

  4. Click Next.

  5. Configure the task.

    1. In the Source section, select the table whose data you want to synchronize.

    2. In the Data Destination section, configure the parameters.

    3. In the Field Mapping section, map the Source Fields to the Destination Fields.

    4. In the Channel Control section, configure the channel parameters.

    For more information, see Configure an offline sync task using the codeless UI.

  6. Run the task.

    1. (Optional) Configure scheduling properties for the task. On the right side of the page, click Scheduling Configuration and configure the scheduling parameters as required. For more information about the parameters, see Scheduling Configuration.

    2. In the upper-left corner of the node configuration tab, click the Save icon to save the task.

    3. In the upper-left corner of the node configuration tab, click the Submit icon to submit the task.

      If you configured scheduling properties for the task, the task runs automatically at the scheduled intervals. You can also click the Run icon in the upper-left corner of the node configuration tab to run the task immediately.

      If the Shell run successfully! message appears in the run log, the task ran successfully.

Step 4: Verify the data synchronization result

  1. Log on to the Kibana console of the destination Alibaba Cloud ES instance.

    For more information, see Log on to the Kibana console.

  2. In the navigation pane on the left, click Dev Tools.

  3. In the Console, run the following command to view the synchronized data:

    POST /hive_esdoc_good_sale/_search?pretty
    {
    "query": { "match_all": {}}
    }
    Note

    hive_esdoc_good_sale is the value that you set for the index field in the data synchronization script.

    If the data is synchronized, the following result is returned.View the synchronized data