You can use Alibaba Cloud Elasticsearch (ES) to perform full-text searches, multi-dimensional queries, or statistical analysis on data from a MySQL database. This topic describes how to use the Data Integration service of DataWorks to synchronize data from a MySQL database to an Alibaba Cloud ES cluster in minutes.
Background information
DataWorks is an end-to-end big data development and governance platform that is built on big data engines. It integrates features such as data development, task scheduling, and data management. You can use sync tasks in DataWorks to quickly synchronize data from various data sources to an Alibaba Cloud ES cluster.
Supported data sources include the following:
ApsaraDB databases (MySQL, PostgreSQL, SQL Server, MongoDB, and HBase)
Alibaba Cloud PolarDB-X (upgraded from the previous DRDS)
Alibaba Cloud MaxCompute
Alibaba Cloud OSS
Alibaba Cloud Tablestore
Self-managed versions of Hadoop Distributed File System (HDFS), Oracle, FTP, DB2, and the database types listed above
Scenarios:
Offline synchronization of big data to an Alibaba Cloud ES cluster. You can synchronize an entire database or a specific table. For more information, see Offline synchronization of an entire MySQL database to Elasticsearch.
Real-time synchronization of big data to an Alibaba Cloud ES cluster. This method supports both full and incremental synchronization. For more information, see Real-time synchronization of an entire MySQL database to Elasticsearch.
Prerequisites
An ApsaraDB RDS for MySQL instance is created. For more information, see Create an ApsaraDB RDS for MySQL instance. This topic uses MySQL 5.7 as an example.
An Alibaba Cloud ES instance is created and the automatic index creation feature is enabled for the instance. For more information, see Create an Alibaba Cloud Elasticsearch instance and Configure the YML file.
A DataWorks workspace is created. For more information, see Create a workspace.
You can synchronize data only to Alibaba Cloud ES. Self-managed Elasticsearch clusters are not supported.
The ApsaraDB RDS for MySQL instance, ES instance, and DataWorks workspace must be in the same region.
The ApsaraDB RDS for MySQL instance, ES instance, and DataWorks workspace must be in the same time zone. Otherwise, a time zone difference may exist between the source and destination data when you synchronize time-related data.
Billing
For more information about ES instance fees, see ES billing items.
For more information about Data Integration resource group fees, see Resource group fees.
Procedure
This topic uses offline synchronization as an example. To synchronize data in real time, see Real-time synchronization of an entire MySQL database to Elasticsearch.
Step 1: Prepare the source data
Create a database and a table in the ApsaraDB RDS for MySQL instance.
You can use an ApsaraDB for RDS database or a self-managed database on a local server. This topic uses an ApsaraDB RDS for MySQL database as an example to demonstrate how to create a database and a table. For more information, see Quick Start.
You can use the following statements to create the table and insert data.
-- create table CREATE TABLE `es_test` ( `id` bigint(32) NOT NULL, `name` varchar(32) NULL, `age` bigint(32) NULL, `hobby` varchar(32) NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARACTER SET=utf8; -- insert data INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (1,'user1',22,'music'); INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (2,'user2',23,'sport'); INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (3,'user3',43,'game'); INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (4,'user4',24,'run'); INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (5,'user5',42,'basketball');
Step 2: Purchase and configure a resource group
Purchase a Data Integration resource group and attach a virtual private cloud (VPC) and a workspace to it.
Log on to the DataWorks console.
In the top menu bar, select the region. In the navigation pane on the left, click Resource Groups.
Click Create Resource Group and configure the required parameters.
In the Operations column of the resource group that you created, click Network Settings to attach a VPC to the resource group. For more information, see Network settings.
The resource group must be able to connect to the VPCs of the ApsaraDB RDS for MySQL and ES instances to synchronize data. You must attach the VPC, Zone, and VSwitch to which the ApsaraDB RDS for MySQL and ES instances belong. To view the VPC information for an instance, see Switch the VPC and vSwitch for an ApsaraDB RDS for MySQL instance and View the basic information of an Elasticsearch instance.
ImportantAfter you attach a VPC, you must add the VSwitch CIDR Block of the VPC to the internal-facing whitelists of the ApsaraDB RDS for MySQL and ES instances. For more information, see Configure an IP address whitelist for an ApsaraDB RDS for MySQL instance and Configure a public or internal-facing access whitelist for an ES instance.
In the upper-left corner of the page, click the back icon to return to the Resource Groups page.
In the Operations column of the resource group that you created, click Attach Workspace to attach the target workspace to the resource group.
Step 3: Add data sources
Add the ApsaraDB RDS for MySQL and ES data sources to the Data Integration service of DataWorks.
Go to the Data Integration page in DataWorks.
Log on to the DataWorks console.
In the navigation pane on the left, click Workspaces.
In the Operations column of the target workspace, choose .
In the navigation pane on the left, click Data Source.
Add a MySQL data source.
On the Data Source List page, click Add Data Source.
On the Add Data Source page, search for and select MySQL.
In the Add MySQL Data Source dialog box, configure the data source parameters in the Basic Information section.
For more information about the configuration, see Configure a MySQL data source.
In the Connection Configuration section, click Test Connectivity. If the connectivity status is Connected, the connection is successful.
Click Complete.
Add an ES data source in the same way. For more information about the configuration, see Configure an ES data source.
Step 4: Configure and run an offline data synchronization task
An offline data synchronization task runs using a resource group. The resource group retrieves data from the data source in Data Integration and writes the data to the ES cluster.
You can configure an offline sync task in two ways. This topic uses the codeless UI as an example. You can also use the code editor to configure an offline sync task. For more information, see Configure a sync node in the code editor, MySQL Reader, and Elasticsearch Writer.
This topic uses the earlier version of Data Development (DataStudio) to demonstrate how to create an offline sync task.
Go to the Data Development page in DataWorks.
Log on to the DataWorks console.
In the navigation pane on the left, click Workspaces.
In the Operations column of the target workspace, choose .
Create an offline sync task.
On the Data Development (
icon) tab in the navigation pane on the left, choose . Follow the on-screen instructions to create a business flow.Right-click the business flow that you created and choose .
In the Create Node dialog box, enter a node name and click Confirm.
Configure the network and resources.
In the Data Source section, set Data Source to MySQL and Data Source Name to the name of the data source from which to synchronize data.
In the My Resource Group section, select a resource group.
In the Data Destination section, set Data Destination to Elasticsearch and Data Source Name to the name of the data source to which to synchronize data.
Click Next.
Configure the task.
In the Data Source section, select the table from which to synchronize data.
In the Data Destination section, configure the parameters for the data destination.
In the Field Mapping section, set the mapping relationship between the Source Field and the Destination Field.
In the Channel Control section, configure the channel parameters.
For more information about the configuration, see Configure a sync node in the codeless UI.
Run the task.
(Optional) Configure scheduling properties for the task. In the right-side pane, click Scheduling and configure the scheduling parameters as needed. For more information about the parameters, see Scheduling.
In the upper-left corner of the node area, click the save icon to save the task.
In the upper-left corner of the node area, click the commit icon to commit the task.
If you configured scheduling properties for the task, the task runs automatically on a regular basis. You can also click the run icon in the upper-left corner of the node area to run the task immediately.
If the log contains
Shell run successfully!, it indicates that the task ran successfully.
Step 5: Verify the data synchronization result
Log on to the Kibana console of the destination Alibaba Cloud ES instance. For more information, see Log on to the Kibana console.
In the upper-left corner of the Kibana page, click the icon and choose Dev Tools.
In the Console, run the following command to view the synchronized data.
POST /es_test/_search?pretty { "query": { "match_all": {}} }NoteReplace
es_testwith the index name that you specified in the data synchronization task.If the data is synchronized successfully, the following result is returned.
