Alibaba Cloud provides a variety of cloud storage and database services. If you want to search for and analyze data stored in these services, you can use the Data Integration service provided by DataWorks to collect the offline data at a minimum interval of 5 minutes and synchronize the collected data to Alibaba Cloud Elasticsearch. In this topic, data is synchronized from an ApsaraDB RDS for MySQL database to an Alibaba Cloud Elasticsearch cluster.
Procedure
- PreparationsPrepare a MySQL data source and create a DataWorks workspace and an Alibaba Cloud Elasticsearch cluster. Configure the Elasticsearch cluster.
- Step 1: Purchase and create an exclusive resource groupCreate an exclusive resource group for Data Integration, and associate the resource group to a virtual private cloud (VPC) and the created workspace. Exclusive resource groups ensure fast and stable data transmission.
- Step 2: Add data sourcesConnect the MySQL data source and Elasticsearch cluster to the Data Integration service provided by DataWorks.
- Step 3: Create and run a data synchronization nodeUse the codeless user interface (UI) to create a node to synchronize data from the MySQL data source to the Elasticsearch cluster and configure the node. Select the exclusive resource group that you created when you configure the node. The data synchronization node runs on the selected exclusive resource group for Data Integration and writes data to the Elasticsearch cluster.
- Step 4: View the synchronized dataView the synchronized data in the Kibana console of the Elasticsearch cluster.
Preparations
- Create a database. You can use an ApsaraDB RDS database or create a database on your on-premises machine. In this topic, an ApsaraDB RDS for MySQL database is used. Join two MySQL tables and synchronize data in the tables to Alibaba Cloud Elasticsearch. The following figures show the two tables. For more information, see Create an ApsaraDB RDS for MySQL instance.
Figure 1. Table 1 Figure 2. Table 2 - Create a DataWorks workspace. For more information, see Create a workspace. The workspace must reside in the same region as the ApsaraDB RDS for MySQL database.
- Create an Alibaba Cloud Elasticsearch cluster and enable the Auto Indexing feature for the cluster. The Elasticsearch cluster must reside in the same virtual private cloud (VPC) as the ApsaraDB RDS for MySQL database. For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file.
Step 1: Purchase and create an exclusive resource group
- Log on to the DataWorks console.
- In the top navigation bar, select the region where your DataWorks workspace resides. In the left-side navigation pane, click Resource Groups.
- Purchase exclusive resources for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration.Important The exclusive resources for Data Integration must reside in the same region as your DataWorks workspace.
- Create an exclusive resource group for Data Integration. For more information, see Create and use an exclusive resource group for Data Integration. The following figure shows the configuration used in this example. Resource Group Type is set to Exclusive Resource Group for Data Integration.
- Find the exclusive resource group that you create and click Network Settings in the Actions column. The VPC Binding tab appears. On the VPC Binding tab, click Add Binding to bind the exclusive resource group to a VPC. For more information, see Associate the exclusive resource group with a VPC. Exclusive resources are deployed in the VPC where DataWorks resides. You can use DataWorks to synchronize data from the ApsaraDB RDS for MySQL database to the Elasticsearch cluster only after DataWorks connects to the VPCs where the database and cluster reside. In this topic, the ApsaraDB RDS for MySQL database and Elasticsearch cluster reside in the same VPC. Therefore, when you bind the exclusive resource group to a VPC, you need to select the VPC and vSwitch to which the Elasticsearch cluster belongs.
- On the Resource Groups page, find the exclusive resource group and click Change Workspace in the Actions column to associate it with the DataWorks workspace that you created. For more information, see Create and use an exclusive resource group for Data Integration.
Step 2: Add data sources
- Go to the Data Integration page.
- In the left-side navigation pane of the DataWorks console, click Workspaces.
- Find the workspace you created and click Data Integration in the Actions column.
- In the left-side navigation pane of the Data Integration page, click Data Source.
- On the Data Source page, click New data source in the upper-right corner.
- In the Relational Database section of the Add data source dialog box, click MySQL. In the Add MySQL data source dialog box, configure the parameters. Data source type: In this example, this parameter is set to Alibaba Cloud instance mode. You can also set this parameter to Connection string mode. For more information about the configurations of other parameters, see Add a MySQL data source.Important If you set the Data source type parameter to Connection string mode, you can set the JDBC URL parameter to the public endpoint of the ApsaraDB RDS for MySQL instance. You must add the elastic IP address (EIP) of the exclusive resource group to the whitelist of the ApsaraDB RDS for MySQL instance. For more information, see Configure an IP address whitelist for an ApsaraDB RDS for MySQL instance and Add the EIP or CIDR block of an exclusive resource group for Data Integration to the whitelist of a data source.After the parameters are configured, you can test the connectivity between the MaxCompute data source and the exclusive resource group. If the connectivity test is passed, Connectable appears in the Connectivity status column.
- Click Complete.
- Add an Elasticsearch data source in the same way. In this example, Data Source Type is set to Connection String Mode. The following table describes only some parameters that need to be configured.
Parameter Description Endpoint The URL that is used to access the Elasticsearch cluster. Specify the URL in the following format: http://<Internal or public endpoint of the Elasticsearch cluster>:9200
. You can obtain the endpoint from the Basic Information page of the cluster. For more information, see View the basic information of a cluster.Important If you use the public endpoint of the cluster, add the elastic IP address (EIP) of the exclusive resource group to a public IP address whitelist of the cluster. For more information, see Configure a public or private IP address whitelist for an Elasticsearch cluster and Add the EIP or CIDR block of an exclusive resource group for Data Integration to the whitelist of a data source.Username The username that is used to access the Elasticsearch cluster. The default username is elastic. Password The password that is used to access the Elasticsearch cluster. The password is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting the password, see Reset the access password for an Elasticsearch cluster. Note Configure the parameters that are not listed in the preceding table based on your business requirements.
Step 3: Create and run a data synchronization node
- On the DataStudio page of the DataWorks console, create a workflow. For more information, see Create a workflow.
- Create a batch synchronization node.
- In the DataStudio pane, open the newly created workflow, right-click Data Integration, and then choose .
- In the Create Node dialog box, configure the Node Name parameter and click Commit.
- In the Source section of the Connections step, specify the MySQL data source and the name of the table that you created. In the Target section, specify the Elasticsearch data source, index name, and index type. Note
- You can also use the code editor to configure the node. For more information, see Configure a batch synchronization node by using the code editor, DRDS Reader, and Elasticsearch Writer.
- We recommend that you set Enable node discovery to No in the advanced settings of the Elasticsearch data source. Otherwise, a connection timeout error occurs during data synchronization.
- In the Mappings step, configure mappings between source fields and destination fields.
- In the Channel step, configure the parameters.
- Configure properties for the node. In the right-side navigation pane of the configuration tab of the node, click Properties. On the Properties tab, configure properties for the node. For more information about the parameters, see Basic properties.Important
- Before you commit a node, you must configure a dependent ancestor node for the node in the Dependencies section of the Properties tab. For more information, see Configure same-cycle scheduling dependencies.
- If you want the system to periodically run a node, you must configure time properties for the node in the Schedule section of the Properties tab. The time properties include Validity Period, Scheduling Cycle, Run At, and Rerun.
- The configuration of an auto triggered node takes effect at 00:00 of the next day.
- Configure the resource group that you want to use to run the synchronization node.
- In the right-side navigation pane of the configuration tab of the node, click the Resource Group configuration tab.
- Select the exclusive resource group that you create from the Exclusive Resource Groups drop-down list.
- Commit the node.
- Save the current configurations and click the
icon in the top toolbar.
- In the Commit Node dialog box, enter your comments in the Change description field.
- Click OK.
- Save the current configurations and click the
- Click the
icon in the top toolbar to run the node.
You can view the operational logs of the node when the node is running. After the node is successfully run, the result shown in the following figure is returned.
Step 4: View the synchronized data
- Log on to the Kibana console of the destination Elasticsearch cluster. For more information, see Log on to the Kibana console.
- In the left-side navigation pane, click Dev Tools.
- On the Console tab of the page that appears, run the following command to query the synchronized data:
POST /mysqljoin/_search?pretty { "query": { "match_all": {}} }
Notemysqljoin
is the value that you configured for theindex
field when you configure the node by using the code editor.If the data is successfully synchronized, the result shown in the following figure is returned.