After you configure the data sources, network environments, and resource groups, you can create and run a real-time synchronization solution to synchronize all data in a database. This topic describes how to create such a real-time synchronization solution and view the running status of the nodes generated by the solution. You can use the solution to first synchronize data to Elasticsearch based on your business requirements in offline mode. In this process, you can synchronize some or all tables in the database. Then, you can use the solution to synchronize incremental data in real time to Elasticsearch.
You can use the hot-warm architecture that is provided by Elasticsearch in a fully managed manner to store the real-time data of enterprises. DataWorks provides real-time synchronization solutions that enable you to synchronize all data in a database to Elasticsearch in real time based on the architecture. You can use such a solution to first synchronize all data in a database to Elasticsearch in offline mode and then incremental data in the database in real time. You can also view the details of the solution, the running status of the nodes generated by the solution, and data updates in the database in real time. This facilitates subsequent data searches, analysis, and development.
- Synchronization rules can be configured in a flexible manner.
You can configure rules to process different data definition language (DDL) statements based on your business requirements. For example, you select Ignore for a DDL message that is specified in the source and used to drop a table in the destination. In this case, the system ignores the statement and does not drop the table in the destination when it receives the statement.
- Large amounts of data can be updated in real time. This makes automated O&M and management more efficient.
Real-time synchronization solutions to synchronize all data in databases can be used when you want the system to monitor data updates in business databases in real time. This way, upper-layer applications can perform searches, analysis, and development on real-time data.
- Only data in a MySQL database can be synchronized to Elasticsearch in real time.
- A real-time synchronization solution used to synchronize all data in a database can be run only on exclusive resource groups.
Create a real-time synchronization solution used to synchronize all data in a database
- Go to the Data Integration page and choose to go to the Task list page.For more information, see Go to the Sync Solutions page.
- On the Task list page, click New task in the upper-right corner.
- In the New synchronization solution dialog box, click One-click realtime synchronization to Elasticsearch.
- In the Set synchronization sources and rules step, configure basic information such
as the solution name for the data synchronization solution.In the Basic configuration section, configure the parameters.
Parameter Description Scheme name The name of the data synchronization solution. The name can be a maximum of 50 characters in length. Description The description of the data synchronization solution. The description can be a maximum of 50 characters in length. Destination task storage location The Automatically establish workflow check box is selected by default. This indicates that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution are placed in the directory of this workflow.
If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution are placed in the specified directory.
- Select a source and configure synchronization rules.
- In the Select the source table for synchronization section, select the tables whose data you want to synchronize from the Source Table list. Then, click the icon to move the tables to the Selected Source table list. The Source Table list displays all tables in the selected source. You can choose to synchronize data in some or all tables in the source.Notice If a selected table has no primary key, you must customize a primary key when you create the mapping between the table and the destination Elasticsearch index. This primary key is used to remove duplicated data during synchronization. For example, you can use one field or a combination of several fields as the primary key of the table. For more information about how to configure mappings between the source tables and the destination Elasticsearch indexes, see Step 6.
- In the Set synchronization rules section, click Add rule and select an option to configure the naming rules for destination tables.Supported options include Conversion Rule for Table Name and Rule for Destination Table name.
- Conversion Rule for Table Name: the rule used to convert the names of source tables to those of destination tables.
- Rule for Destination Table name: the rule used to add a prefix and a suffix to the converted names of destination tables.
- In the Select the source table for synchronization section, select the tables whose data you want to synchronize from the Source Table list. Then, click the icon to move the tables to the Selected Source table list.
- Select the destination and configure the destination index.
- In the Set Destination Index step, specify Target Elasticsearch data source.
- Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch indexes.
- View the mapping progress, source tables, and mapped destination indexes.
No. Description 1 The progress of mapping between the source tables and destination indexes.Note The mapping may require a long time if you want to synchronize data from a large number of tables. 2 3 The method used to create an index. Valid values:
- Click Next Step.
- Configure rules to process DDL messages. DDL messages exist in the source. Before you synchronize data, you can configure synchronization rules for different DDL messages based on your business requirements.Note The rules apply the first time a real-time synchronization solution is run. If you want to modify the rules in subsequent operations, go to the configuration page of the real-time synchronization solution to perform the operation. For more information, see Manage the real-time synchronization solution.
- In the Processing Policy for DDL Messages step, configure the policies to process DDL messages for the real-time synchronization
nodes generated by the synchronization solution. The following table describes the policies to process different DDL messages.
DDL message Policy CreateTable DataWorks processes a DDL message of the related type based on the following policies after it receives the message:
- Normal: sends the message to the destination. Then, the destination processes the message. Each destination may process DDL messages based on its own business logic. If you select Normal for CreateTable, DataWorks only forwards the messages. For example, an AddColumn DDL message is a wrong instruction to MaxCompute, but it is a normal instruction to Hologres.
- Ignore: ignores the message and does not send it to the destination.
- Alert: ignores the message and records the alert in real-time synchronization logs. In addition, the alert contains information about the reason that a message is ignored because of a running error.
- Error: returns an error during the running of the real-time synchronization solution and terminates the real-time synchronization solution.
DropTable AddColumn DropColumn RenameTable RenameColumn ChangeColumn TruncateTable
- Click Next Step.
- In the Processing Policy for DDL Messages step, configure the policies to process DDL messages for the real-time synchronization nodes generated by the synchronization solution.
- Configure the resources required for the data synchronization solution. In the Run resource settings step, configure the parameters.
- Offline Full synchronization
Parameter Description Offline task name rules The name of the batch synchronization node that is used to synchronize the full data of the source. After a data synchronization solution is created, DataWorks first generates a batch synchronization node to synchronize full data, and then generates real-time synchronization nodes to synchronize incremental data. Resource Group for Full Batch Sync Nodes
The exclusive resource group for data integration that is used to run the batch synchronization node.
- Full Batch Scheduling
Parameter Description Select scheduling Resource Group
The resource group for scheduling that is used to run the nodes.Only exclusive resource groups for data integration can be used to run solutions. You can set this parameter to the name of the exclusive resource group for data integration that you purchased.Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
- Real-time incremental synchronization
Parameter Description Select an exclusive resource group for real-time tasks
The exclusive resource group that is used to run the real-time synchronization nodes.
- Channel Settings
Parameter Description Maximum number of connections supported by source read The maximum number of Java Database Connectivity (JDBC) connections that are allowed by the source database. Specify an appropriate number based on the resources of the source database. The default value is 20.
- Offline Full synchronization
- Click Complete configuration. The real-time synchronization solution used to synchronize all data in a database is created.
Run the real-time synchronization solution
On the Task list page, find the created data synchronization solution and click Submit execution in the Operation column to run the data synchronization solution.
View the running status and result of the synchronization nodes
- On the Task list page, find the solution that is run and click Execution details in the Operation column. Then, you can view the running details of all nodes generated by the synchronization solution.
- Find a node whose running details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to go to the DataStudio page.