Your business data is stored in a Distributed Relational Database Service (DRDS) database. If you want to perform full-text searches and semantic analytics on the data, you can synchronize the data to an Alibaba Cloud Elasticsearch cluster in offline mode.

Background information

DRDS is developed by Alibaba Cloud. It integrates the distributed SQL engine DRDS and the proprietary distributed storage X-DB. DRDS supports tens of millions of concurrent connections and can store hundreds of petabytes of data based on the integrated cloud-native architecture. DRDS aims to provide solutions for massive data storage, ultra-high concurrent throughput, large table performance bottlenecks, and complex computing efficiency. DRDS has become a mature service after it is applied to Double 11 and the business of Alibaba Cloud customers in various industries. The application of DRDS boosts the digital transformation of enterprises. For more information, see Overview.

Alibaba Cloud Elasticsearch is compatible with open source Elasticsearch features, such as Security, Machine Learning, Graph, and Application Performance Monitoring (APM). It is released in 5.5.3, 6.3.2, 6.7.0, 6.8.0, 7.4.0, and 7.7.1 versions. It supports the commercial plug-in X-Pack and is ideal for scenarios such as data analytics and searches. Alibaba Cloud Elasticsearch implements enterprise-grade access control, security monitoring and alerting, and automated reporting based on open source Elasticsearch. For more information, see What is Alibaba Cloud Elasticsearch?. You can click here to apply for a free trial.

Procedure

  1. Preparations
    Create a DRDS instance and an Alibaba Cloud Elasticsearch cluster in the same virtual private cloud (VPC). Prepare the data that you want to migrate in the DRDS instance. Activate the Data Integration and DataStudio services of DataWorks.
    Note To improve the stability of synchronization nodes, we recommend that you synchronize data within a VPC.
  2. Step 1: Purchase and create an exclusive resource group
    In the DataWorks console, purchase and create an exclusive resource group. To ensure the network connection, you must bind the exclusive resource group to the VPC where the DRDS instance resides.
    Note Exclusive resource groups can be used to transmit data in a fast and stable manner.
  3. Step 2: Add data sources
    In the DataWorks console, add the DRDS instance and the Elasticsearch cluster as data sources.
  4. Step 3: Create and run a data synchronization node
    Use the codeless user interface (UI) to create a node to synchronize data from the MySQL data source to the Elasticsearch cluster and configure the node. Select the exclusive resource group that you created when you configure the node. The data synchronization node runs on the selected exclusive resource group for Data Integration and writes data to the Elasticsearch cluster.
  5. Step 4: View synchronization results
    In the Kibana console of the Elasticsearch cluster, view the synchronized data and search for data by using a specific field.

Preparations

  1. Create a DRDS V1.0 instance, a DRDS database, and a table. Then, insert data into the table.
    For more information, see Basic SQL operations. The following figure shows the test data that is used in this topic. Test data
    Notice After a database is created, all IP addresses are allowed to access the database by default. For security purposes, we recommend that you add only the IP address of the host that you use to the whitelist of the DRDS instance. For more information, see Set an IP address whitelist.
  2. Create a DataWorks workspace.
    For more information, see Create a workspace. The workspace must reside in the same region as the DRDS instance that you created.
  3. Create an Elasticsearch cluster and enable the Auto Indexing feature for the cluster.
    For more information, see Create an Alibaba Cloud Elasticsearch cluster and Access and configure an Elasticsearch cluster. The cluster must belong to the same VPC and vSwitch as the DRDS instance.

Step 1: Purchase and create an exclusive resource group

  1. Log on to the DataWorks console.
  2. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Groups.
  3. Purchase exclusive resources for Data Integration. For more information, see Purchase exclusive resources for Data Integration.
    Notice The exclusive resources for Data Integration must reside in the same region as the DataWorks workspace that you created.
  4. Create an exclusive resource group for Data Integration. For more information, see Create an exclusive resource group for Data Integration.
    The following figure shows the configuration used in this example. Resource Group Type is set to Exclusive Resource Groups for Data Integration. Create an exclusive resource group
  5. Find the created exclusive resource group and click Network Settings in the Actions column. The VPC Binding tab appears. On the VPC Binding tab, click Add Binding to bind the exclusive resource group to a VPC. For more information, see Configure network settings.
    Exclusive resources are deployed in the VPC where DataWorks resides. You can use DataWorks to synchronize data from the DRDS database to the Elasticsearch cluster only after DataWorks connects to the VPCs where the database and cluster reside. In this topic, the DRDS database and Elasticsearch cluster reside in the same VPC. Therefore, when you bind the exclusive resource group to a VPC, you need to select the VPC and vSwitch to which the DRDS instance belongs. Bind an exclusive resource group for Data Integration to a VPC
  6. Click Change Workspace in the Actions column that corresponds to the exclusive resource group to bind it to the DataWorks workspace that you created. For more information, see Associate an exclusive resource group with a workspace.
    Bind an exclusive resource group to a workspace

Step 2: Add data sources

  1. Go to the Data Integration page.
    1. In the left-side navigation pane of the DataWorks console, click Workspaces.
    2. Find the workspace you created and click Data Integration in the Actions column.
  2. In the left-side navigation pane of the Data Integration page, choose Data Source > Data Sources.
  3. On the Data Source page, click Add data source in the upper-right corner.
  4. In the Relational Database section of the Add data source dialog box, click DRDS.
  5. In the Add DRDS data source dialog box, configure the parameters and test connectivity between the DRDS data source and resource group that you created. After the connectivity test is passed, click Complete.
    Add a DRDS data source
    Parameter Description
    Data source type In this topic, this parameter is set to Connection string mode. You can also set this parameter to Alibaba Cloud Database (DRDS). For more information, see Add a DRDS data source.
    Data Source Name The name of the data source. The name must contain letters, digits, and underscores (_). It must start with a letter.
    Data source description The description of the data source. The description can be a maximum of 80 characters in length.
    JDBC URL The Java Database Connectivity (JDBC) URL of the database, in the format of jdbc:mysql://ServerIP:Port/Database. Replace ServerIP:Port with Endpoint of the VPC where the DRDS instance resides:Port number of the VPC. Replace Database with the name of the DRDS database that you created.
    User name The username that is used to connect to the DRDS database.
    Password The password that is used to connect to the DRDS database.
  6. Add an Elasticsearch data source in the same way.
    Add an Elasticsearch data source
    Parameter Description
    Data Source Name The name of the data source. The name must contain letters, digits, and underscores (_). It must start with a letter.
    Data source description The description of the data source. The description can be a maximum of 80 characters in length.
    Endpoint Set this parameter to a value in the format of http://<Internal endpoint of the Elasticsearch cluster>:9200. You can obtain the internal endpoint from the Basic Information page of the cluster.
    Username The username that is used to access the Elasticsearch cluster. The default username is elastic.
    Password The password that corresponds to the elastic username. The password of the elastic username is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting a password, see Reset the access password for an Elasticsearch cluster.

Step 3: Create and run a data synchronization node

  1. On the DataStudio page of the DataWorks console, create a workflow.
    For more information, see Manage workflows.
  2. Create a batch synchronization node.
    1. In the DataStudio pane, open the newly created workflow, right-click Data Integration, and then choose Create > Batch Synchronization.
    2. In the Create Node dialog box, configure the Node Name parameter and click Commit.
  3. In the Source section of the Connections step, specify the DRDS data source and the name of the table that you created. In the Target section, specify the Elasticsearch data source, index name, and index type.
    Specify data sources
    Note
  4. In the Mappings step, configure mappings between source fields and destination fields.
    In this example, the default source fields are used. You need only to change destination fields. Click the Change Fields icon in the destination fields section on the right. In the Change Fields dialog box, enter the following information:
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    null
    The following figure shows the configured field mappings. Configure field mappings
  5. In the Channel step, configure the parameters.
  6. Configure properties for the node.
    In the right-side navigation pane of the configuration tab of the node, click Properties. On the Properties tab, configure properties for the node. For more information about the parameters, see Basic properties.
    Notice
    • Before you commit a node, you must configure a dependent ancestor node for the node in the Dependencies section of the Properties tab. For more information, see Instructions to configure scheduling dependencies.
    • If you want the system to periodically run a node, you must configure time properties for the node in the Schedule section of the Properties tab. The time properties include Validity Period, Scheduling Cycle, Run At, and Rerun.
    • The configuration of an auto triggered node takes effect at 00:00 of the next day.
  7. Configure the resource group that you want to use to run the synchronization node.
    Select a resource group
    1. In the right-side navigation pane of the configuration tab of the node, click the Resource Group configuration tab.
    2. Select the exclusive resource group that you create from the Exclusive Resource Groups drop-down list.
  8. Commit the node.
    1. Save the current configurations and click the Submit icon icon in the top toolbar.
    2. In the Commit Node dialog box, enter your comments in the Change description field.
    3. Click OK.
  9. Click the Run icon icon in the top navigation bar to run the node.
    You can view the operational logs of the node when the node is running. successfully indicates that the node is successfully run. FINISH indicates that the running of the node is complete. Run the node
    Note Before you run the node, you can configure properties for the node and select the desired resource group to run the node. For more information, see Basic properties .

Step 4: View synchronization results

  1. Log on to the Kibana console of the destination Elasticsearch cluster.
    For more information, see Log on to the Kibana console.
  2. In the left-side navigation pane, click Dev Tools.
  3. On the Console tab of the page that appears, run the following command to query the volume of data in the Elasticsearch cluster.
    Note You can compare the queried data volume with the volume of data in the DRDS database to check whether all data is synchronized.
    GET drdstest/_search
    {
      "query": {
        "match_all": null
      }
    }
    If the command is successfully run, the result shown in the following figure is returned. Query data volume in the Elasticsearch cluster
  4. Run the following command to search for data by using a specific field:
    GET drdstest/_search
    {
      "query": {
        "term": {
          "Publisher.keyword": {
            "value": "Nintendo"
          }
        }
      }
    }
    If the command is successfully run, the result shown in the following figure is returned. Field-based data search