Synchronize data from MaxCompute to Alibaba Cloud ES - Elasticsearch

You can use Alibaba Cloud Elasticsearch (ES) to retrieve information, run multi-dimensional queries, and perform statistical analysis on large volumes of data in MaxCompute (ODPS). This topic describes how to use the Data Integration service of DataWorks to synchronize large volumes of MaxCompute data to an Alibaba Cloud ES instance in offline mode. This process typically takes only a few minutes.

Background information

DataWorks is an end-to-end big data development and governance platform based on big data compute engines. DataWorks provides features such as data development, task scheduling, and data management. You can create synchronization tasks in DataWorks to rapidly synchronize data from various data sources to Alibaba Cloud Elasticsearch.

The following types of data sources are supported:
- Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase
- Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS)
- Alibaba Cloud MaxCompute
- Alibaba Cloud Object Storage Service (OSS)
- Alibaba Cloud Tablestore
- Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase
The following synchronization scenarios are supported:
- Synchronize big data from a database or table to Alibaba Cloud Elasticsearch in offline mode. For more information, see Create a batch synchronization task to synchronize all data in a database to Elasticsearch.
- Synchronize full and incremental big data to Alibaba Cloud Elasticsearch in real time. For more information, see Create a real-time synchronization task to synchronize data to Elasticsearch.

Prerequisites

A MaxCompute project is created. For more information, see Create a MaxCompute project.
An Alibaba Cloud Elasticsearch cluster is created, and the Auto Indexing feature is enabled for the cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file.
A DataWorks workspace is created. For more information, see Create a workspace.

Note

Data synchronization is supported only for Alibaba Cloud ES instances. Self-managed Elasticsearch clusters are not supported.
The MaxCompute project, ES instance, and DataWorks workspace must be in the same region.
The ES instance, MaxCompute project, and DataWorks workspace must be in the same time zone. Otherwise, a time zone difference may occur between the source and destination data after time-related data is synchronized.

Billing

For information about the billing of Alibaba Cloud Elasticsearch clusters, see Elasticsearch billable items.
For information about the billing of exclusive resource groups for Data Integration, see Billing of exclusive resource groups for Data Integration (subscription).

Procedure

Step 1: Prepare source data

Create a table in MaxCompute and import data into the table. For more information, see Create tables and Import data to tables.

The following table schema and data are used in this topic:

Table schema
A portion of the table data

Step 2: Purchase and configure an exclusive resource group

Purchase an exclusive resource group for Data Integration. Then, attach a virtual private cloud (VPC) and a workspace to the resource group. An exclusive resource group ensures fast and stable data transmission.

Log on to the DataWorks console.
In the top navigation bar, select a region. In the navigation pane on the left, click Resource Group.
On the Exclusive Resource Groups tab, click Create Legacy Resource Group > Data Integration Resource Group.
On the DataWorks Exclusive Resource (Subscription) purchase page, set Exclusive Resource Type to Exclusive Resource For Data Integration, enter a name for the resource group, and then click Buy Now to purchase the exclusive resource group.
For more information, see Step 1: Create an exclusive resource group for Data Integration.
Find the created exclusive resource group and in the Actions column, click Network Settings to attach a virtual private cloud (VPC). For more information, see Attach a VPC.
Note
In this example, an exclusive resource group for Data Integration is used to synchronize data over a VPC. For information about how to use an exclusive resource group for Data Integration to synchronize data over the Internet, see Configure an IP address whitelist.
The exclusive resource group must be connected to the VPC where the Elasticsearch cluster resides. This allows data to be synchronized using the exclusive resource group. Therefore, you must associate the exclusive resource group with the VPC, Zone, and vSwitch of the Elasticsearch cluster. To view the VPC, zone, and vSwitch of the Elasticsearch cluster, see View the basic information of a cluster.
Important
After you associate a VPC, you must add the vSwitch CIDR Block of the VPC to the VPC internal-facing access whitelist of the Elasticsearch instance. For more information, see Configure a public or internal-facing access whitelist for an Elasticsearch instance.
In the upper-left corner of the page, click the back icon to return to the Resource Group List page.
In the Operation column of the created exclusive resource group, click Attach Workspace to attach the target workspace to the resource group.
For more information, see Step 2: Associate the exclusive resource group for Data Integration with a workspace.

Step 3: Add data sources

Add MaxCompute and Elasticsearch as data sources to the Data Integration service of DataWorks.

Go to the Data Integration page in DataWorks.
1. Log on to the DataWorks console.
2. In the navigation pane on the left, click Workspaces.
3. In the Operation column of the target workspace, choose Quick Access > Data Integration.
In the navigation pane on the left, click Data Source.
Add a MaxCompute data source.
1. On the Data Source List page, click Add Data Source.
2. On the Add Data Source page, search for and select MaxCompute.
3. In the Add MaxCompute Data Source dialog box, configure the data source parameters in the Basic Information section.
  For more information, see Add a MaxCompute data source.
4. In the Connection Configuration section, click Test Connectivity. If the connectivity status is Connected, the connection is successful.
5. Click Complete.
Add an Elasticsearch data source in the same way. For more information, see Add an Elasticsearch data source.

Step 4: Configure and run a data synchronization task

A data synchronization task runs using the exclusive resource group. The resource group retrieves data from the data source in Data Integration and writes the data to Elasticsearch.

Note

You can use the codeless UI or code editor to configure the batch synchronization task. In this example, the codeless UI is used. For information about how to use the code editor to configure the batch synchronization task, see Configure a batch synchronization task using the code editor and Elasticsearch Writer.
This topic uses legacy Data Development (DataStudio) as an example to create an offline sync task.

Go to the Data Development page in DataWorks.
1. Log on to the DataWorks console.
2. In the navigation pane on the left, click Workspaces.
3. In the Operation column of the target workspace, choose Quick Access > Data Development.
Create a batch synchronization task.
1. In the left-side navigation pane, choose Create > Create Workflow to create a workflow.
2. Right-click the name of the newly created workflow and choose Create Node > Offline synchronization.
3. In the Create Node dialog box, configure the Name parameter and click Confirm.
Configure the network and resource group.
1. In the Data Source section, set Source to MaxCompute (ODPS) and Data Source Name to the name of the source data source.
2. In the My Resource Group section, select the exclusive resource group.
3. In the Data Destination section, set Destination to Elasticsearch and Data Source Name to the name of the destination data source.
Click Next.
Configure the task.
1. In the Data Source section, select the source table.
2. In the Data Destination section, configure the parameters.
3. In the Field Mapping section, set the mappings between Source Field and Destination Field.
4. In the Channel Control section, configure the channel parameters.
For more information, see Configure a batch synchronization task using the codeless UI.
Run the task.
1. (Optional) Configure scheduling properties for the task. In the right-side navigation pane, click Properties. On the Properties tab, configure the parameters as needed. For more information about the parameters, see Scheduling configuration.
2. In the upper-right corner of the node configuration tab, click the Save icon to save the task.
3. In the upper-right corner of the node configuration tab, click the Submit icon to submit the task.
  If you configured scheduling properties, the task runs automatically on a schedule. You can also click the Run icon in the upper-right corner of the node configuration tab to run the task immediately.
  If the log contains Shell run successfully!, the task ran successfully. The following code shows a sample log:
```
2023-10-31 16:52:35 INFO Exit code of the Shell command 0
2023-10-31 16:52:35 INFO --- Invocation of Shell command completed ---
2023-10-31 16:52:35 INFO Shell run successfully!
2023-10-31 16:52:35 INFO Current task status: FINISH
2023-10-31 16:52:35 INFO Cost time is: 33.106s
```

Step 5: Verify the data synchronization result

In the Kibana console, you can view the synchronized data and query the data based on specified conditions.

Log on to the Kibana console of the target Alibaba Cloud ES instance.
For more information, see Log on to the Kibana console.
In the upper-left corner of the Kibana page, click the icon and select Dev Tools.
In the Console, run the following command to view the synchronized data.
```
POST /odps_index/_search?pretty
{
"query": { "match_all": {}}
}
```
Note
odps_index is the value of the index field that you set in the data synchronization script.
If the data is synchronized, a result similar to the following one is returned.

Run the following command to search for the category and brand fields in the documents.

POST /odps_index/_search?pretty
{
"query": { "match_all": {} },
"_source": ["category", "brand"]
}

Run the following command to search for documents where the category is fresh produce.

POST /odps_index/_search?pretty
{
"query": { "match": {"category":"fresh produce"} }
}

Run the following command to sort the documents by the trans_num field.
```
POST /odps_index/_search?pretty
{
"query": { "match_all": {} },
"sort": { "trans_num": { "order": "desc" } }
}
```
For more information about commands and access methods, see the Elastic.co Help Center.