Alibaba Cloud provides a variety of cloud storage and database services. To search for and analyze the data stored in these services, you can use the Data Integration service of DataWorks to synchronize the data to an Alibaba Cloud Elasticsearch cluster. The Data Integration service can collect offline data at a minimum interval of 5 minutes. This topic describes how to synchronize data from a MaxCompute project to an Alibaba Cloud Elasticsearch cluster.

Background information

You can synchronize offline data to Alibaba Cloud Elasticsearch from the following data sources:
  • Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB RDS for PPAS, ApsaraDB for MongoDB, and ApsaraDB for HBase
  • Alibaba Cloud Distributed Relational Database Service (DRDS)
  • Alibaba Cloud MaxCompute
  • Alibaba Cloud Object Storage Service (OSS)
  • Alibaba Cloud Tablestore
  • Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, PPAS, MongoDB, and HBase

Procedure

  1. Preparations
    Create a DataWorks workspace, activate MaxCompute, prepare a MaxCompute data source, and create an Alibaba Cloud Elasticsearch cluster.
  2. Step 1: Purchase and create an exclusive resource group
    Purchase and create an exclusive resource group for Data Integration. Bind the exclusive resource group to a virtual private cloud (VPC) and the created workspace. Exclusive resource groups can be used to transmit data in a fast and stable manner.
  3. Step 2: Add data sources
    Connect the MaxCompute and Elasticsearch data sources to the Data Integration service of DataWorks.
  4. Step 3: Create and run a data synchronization node
    Use the codeless user interface (UI) to create a node to synchronize data from the MySQL data source to the Elasticsearch cluster and configure the node. Select the exclusive resource group that you created when you configure the node. The data synchronization node runs on the selected exclusive resource group for Data Integration and writes data to the Elasticsearch cluster.
  5. Step 4: View the synchronized data
    In the Kibana console, view the synchronized data and search for data based on specific conditions.

Preparations

  1. Create a DataWorks workspace. Select MaxCompute as the compute engine.
    For more information, see Create a MaxCompute project.
  2. Create a table in MaxCompute and import data into the table.
    For more information, see Create tables and Import data to tables.
    The following figures show the table schema and a part of the table data.
    Figure 1. Table schema
    Table schema
    Figure 2. Table data
    Table data
    Note The provided data is only for tests. You can migrate data from Hadoop to MaxCompute and synchronize the data to your Elasticsearch cluster. For more information, see Migrate data from Hadoop to MaxCompute.
  3. Create an Elasticsearch cluster and enable the Auto Indexing feature for the cluster.
    For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file. The Elasticsearch cluster must reside in the same region as the DataWorks workspace that you created.

Step 1: Purchase and create an exclusive resource group

  1. Log on to the DataWorks console.
  2. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Groups.
  3. Purchase exclusive resources for Data Integration. For more information, see Purchase exclusive resources for Data Integration.
    Notice The exclusive resources for Data Integration must reside in the same region as the DataWorks workspace that you created.
  4. Create an exclusive resource group for Data Integration. For more information, see Create an exclusive resource group for Data Integration.
    The following figure shows the configuration used in this example. Resource Group Type is set to Exclusive Resource Groups for Data Integration. Create an exclusive resource group
  5. Find the created exclusive resource group and click Network Settings in the Actions column. The VPC Binding tab appears. On the VPC Binding tab, click Add Binding to bind the exclusive resource group to a VPC. For more information, see Configure network settings.
    Exclusive resources are deployed in the VPC where DataWorks resides. You can use DataWorks to synchronize data from the MaxCompute project to the Elasticsearch cluster only after DataWorks connects to the VPCs where the project and cluster reside. Therefore, when you bind the exclusive resource group to a VPC, you need to select the VPC and vSwitch to which the Elasticsearch cluster belongs. Bind an exclusive resource group for Data Integration to a VPC
  6. Click Change Workspace in the Actions column that corresponds to the exclusive resource group to bind it to the DataWorks workspace that you created. For more information, see Associate an exclusive resource group with a workspace.
    Bind an exclusive resource group to a workspace

Step 2: Add data sources

  1. Go to the Data Integration page.
    1. In the left-side navigation pane of the DataWorks console, click Workspaces.
    2. Find the workspace you created and click Data Integration in the Actions column.
  2. In the left-side navigation pane of the Data Integration page, choose Data Source > Data Sources.
  3. On the Data Source page, click Add data source in the upper-right corner.
  4. In the Big Data Storage section of the Add data source dialog box, click MaxCompute . In the Add MaxCompute data source dialog box, configure the parameters.
    Add a MaxCompute data source
    Parameter Description
    ODPS Endpoint The endpoint of MaxCompute, which varies in different regions. For more information, see Endpoints.
    ODPS project name To obtain the project name, switch back to the overview page of the DataWorks console. In the left-side navigation pane, choose Compute Engines > MaxCompute.
    AccessKey ID The AccessKey ID of your Alibaba Cloud account. To obtain the AccessKey ID, move the pointer over your profile picture and click AccessKey Management.
    AccessKey Secret The AccessKey secret of your Alibaba Cloud account. To obtain the AccessKey secret, move the pointer over your profile picture and click AccessKey Management.
    Note Configure the parameters that are not listed in the preceding table based on your business requirements or retain the default values of the parameters.
    After the parameters are configured, you can test the connectivity between the MaxCompute data source and the exclusive resource group. If the connectivity test is passed, Connectable appears in the Connectivity status column. Success
  5. Click Complete.
  6. Add an Elasticsearch data source in the same way.
    Configuration of the Elasticsearch data source
    Parameter Description
    Endpoint The URL that is used to access the Elasticsearch cluster. Specify the URL in the following format: http://<Internal or public endpoint of the Elasticsearch cluster>:9200. You can obtain the endpoint from the Basic Information page of the cluster. For more information, see View the basic information of a cluster.
    Notice If you use the public endpoint of the cluster, add the elastic IP address (EIP) of the exclusive resource group to the public IP address whitelist of the cluster. For more information, see Configure a public or private IP address whitelist for an Elasticsearch cluster and Add the EIP or CIDR block of an exclusive resource group for Data Integration to the whitelist of a data source.
    Username The username that is used to access the Elasticsearch cluster. The default username is elastic.
    Password The password that corresponds to the elastic username. The password of the elastic username is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting the password, see Reset the access password for an Elasticsearch cluster.
    Note Configure the parameters that are not listed in the preceding table based on your business requirements.

Step 3: Create and run a data synchronization node

  1. On the DataStudio page of the DataWorks console, create a workflow.
    For more information, see Manage workflows.
  2. Create a batch synchronization node.
    1. In the DataStudio pane, open the newly created workflow, right-click Data Integration, and then choose Create > Batch Synchronization.
    2. In the Create Node dialog box, configure the Node Name parameter and click Commit.
  3. In the Source section of the Connections step, specify the MaxCompute data source and the name of the table that you created. In the Target section, specify the Elasticsearch data source, index name, and index type.
    Specify the MaxCompute data source
    Note
  4. In the Mappings step, configure mappings between source fields and destination fields.
  5. In the Channel step, configure the parameters.
  6. Configure properties for the node.
    In the right-side navigation pane of the configuration tab of the node, click Properties. On the Properties tab, configure properties for the node. For more information about the parameters, see Basic properties.
    Notice
    • Before you commit a node, you must configure a dependent ancestor node for the node in the Dependencies section of the Properties tab. For more information, see Instructions to configure scheduling dependencies.
    • If you want the system to periodically run a node, you must configure time properties for the node in the Schedule section of the Properties tab. The time properties include Validity Period, Scheduling Cycle, Run At, and Rerun.
    • The configuration of an auto triggered node takes effect at 00:00 of the next day.
  7. Configure the resource group that you want to use to run the synchronization node.
    Select a resource group
    1. In the right-side navigation pane of the configuration tab of the node, click the Resource Group configuration tab.
    2. Select the exclusive resource group that you create from the Exclusive Resource Groups drop-down list.
  8. Commit the node.
    1. Save the current configurations and click the Submit icon icon in the top toolbar.
    2. In the Commit Node dialog box, enter your comments in the Change description field.
    3. Click OK.
  9. Click the Run icon icon in the top toolbar to run the node.
    You can view the operational logs of the node when the node is running. After the node is successfully run, the result shown in the following figure is returned. Success

Step 4: View the synchronized data

  1. Log on to the Kibana console of the destination Elasticsearch cluster.
    For more information, see Log on to the Kibana console.
  2. In the left-side navigation pane, click Dev Tools.
  3. On the Console tab of the page that appears, run the following command to query the synchronized data:
    POST /odps_index/_search?pretty
    {
    "query": { "match_all": {}}
    }
    Note Set odps_index to the value that you specified for the index field when you configure the node by using the code editor.
    If the data is synchronized, the result shown in the following figure is returned. View the synchronized data
  4. Run the following command to query the category and brand fields in the data:
    POST /odps_index/_search?pretty
    {
    "query": { "match_all": {} },
    "_source": ["category", "brand"]
    }
  5. Run the following command to query data entries where the value of the category field is fresh:
    POST /odps_index/_search?pretty
    {
    "query": { "match": {"category":"fresh"} }
    }
  6. Run the following command to sort the data based on the trans_num field:
    POST /odps_index/_search?pretty
    {
    "query": { "match_all": {} },
    "sort": { "trans_num": { "order": "desc" } }
    }

    For more information, see open source Elastic documentation.