After you configure data sources, network environments, and resource groups, you can
create and run a batch synchronization solution to synchronize all data in a database.
This topic describes how to create a batch synchronization solution to synchronize
data in some or all tables in a database to Elasticsearch. This topic also describes
how to view the statuses of the nodes generated by the batch synchronization solution.
Prerequisites
Before you create a data synchronization solution, make sure that the following operations
are performed:
Background information
In most cases, the real-time data of enterprises are stored in big data engines, and
a large volume of non-structured log data may be generated for the real-time data.
You can use the hot-warm architecture that is provided by Elasticsearch in a fully
managed manner to store the log data and offline data of enterprises. DataWorks provides
batch synchronization solutions that can be used to synchronize all data in a database
to Elasticsearch based on the architecture. You can view the details of the solution
and the statuses of the nodes generated by the solution. This makes automated operations
and maintenance (O&M) and management more efficient.
You can use a batch synchronization solution to synchronize the full or incremental
data in your business database to Elasticsearch. Then, the data can be searched, analyzed,
and developed in Elasticsearch. A batch synchronization solution used to synchronize
all data in a database has the following benefits:
- Synchronizes the full data of a database.
You do not need to create multiple batch data synchronization nodes to synchronize
source tables one by one. You can directly create a batch synchronization solution
to synchronize some or all of the tables in a database at a time.
- Supports various data synchronization methods.
You can use one of the following methods to synchronize data: full data synchronization,
incremental data synchronization, and a combination of full and incremental data synchronization.
In addition, you can configure properties for your batch synchronization solution.
- Requires only simple configurations.
You do not need to perform complex operations, such as creating synchronization nodes,
databases, and tables, configuring dependencies for nodes, and configure mappings
between sources and destinations. Instead, you need only to configure a batch synchronization
solution in a configuration wizard.
- Reduces costs and improves O&M efficiency.
Limits
- You can use a batch synchronization solution to synchronize all data only in a MySQL,
SQL Server, or PolarDB database to Elasticsearch.
- A batch synchronization solution used to synchronize all data in a database can be
run only on resources in exclusive resource groups for Data Integration.
Create a batch synchronization solution to synchronize all data in a database
- Go to the Data Integration page and choose to go to the Task list page.
- On the Task list page, click New task in the upper-right corner.
- In the New synchronization solution dialog box, click One-click batch synchronization to Elasticsearch.
- In the Set synchronization sources and rules step, configure basic information such
as the solution name for the data synchronization solution.
In the
Basic configuration section, configure the parameters.

Parameter |
Description |
Scheme name |
The name of the data synchronization solution. The name can be a maximum of 50 characters
in length.
|
Description |
The description of the data synchronization solution. The description can be a maximum
of 50 characters in length.
|
Destination task storage location |
The Automatically establish workflow check box is selected by default. This indicates
that DataWorks automatically creates a workflow named in the format of clone_database_Source data source name+to+Destination data source name in the Data Integration directory. All synchronization nodes generated by the data synchronization solution
are placed in the directory of this workflow.
If you clear the Automatically establish workflow check box, select a directory from the Select Location drop-down list. All synchronization nodes generated by the data synchronization solution
are placed in the specified directory.
|
- Select a data source as the source and configure synchronization rules.
- In the Data source section, specify the Type and Data source parameters.
Note You can select MySQL, SQL Server, or PolarDB as the source.
- In the Source Table section, select the tables whose data you want to synchronize from the Source Table list. Then, click the
icon to move the tables to the Selected Source Table list. 
The Source Table list displays all tables in the selected source. You can choose to
synchronize data in some or all tables in the source.
Notice If a selected table has no primary key, you must customize a primary key when you
map the table to a destination Elasticsearch index. This primary key is used to remove
duplicate data during synchronization. For example, you can use one field or a combination
of several fields as the primary key of the table. For more information, see
Step 6 in this topic.
- In the Conversion Rule for Table Name section, click Add rule to select a rule.
Supported options include
Conversion Rule for Table Name and
Rule for Destination Index Name.
- Conversion Rule for Table Name: the rule for converting the names of source tables to those of destination Elasticsearch
indexes.
- Rule for Destination Index Name: the rule for adding a prefix and a suffix to the converted names of destination
Elasticsearch indexes.
- Click Next Step.
- Select a destination cluster and configure destination Elasticsearch indexes.
- In the Set Destination Index step, specify Destination.
- Click Refresh source table and Elasticsearch Index mapping to configure the mappings between the source tables and destination Elasticsearch
indexes.
- View the mapping progress, source tables, and mapped destination Elasticsearch indexes.

No. |
Description |
1 |
The progress of mapping the source tables to destination Elasticsearch indexes.
Note The mapping may require a long time if you want to synchronize data from a large number
of tables.
|
2 |
- If the tables in the source database contain primary keys, the system removes duplicate
data based on the primary keys during the synchronization.
- If the tables in the source database do not contain primary keys, you can click the
icon to customize primary keys. You can use one field or a combination of several
fields as the primary keys of the tables. This way, the system removes duplicate data
based on the primary keys during the synchronization.
Note In the following cases, you must configure the primary keys:
- You use an incremental data synchronization method to synchronize data.
- You use a full data synchronization method to synchronize data and set Write Policy to Update.
For more information about synchronization methods, see the synchronization methods
described in Step 7 in this topic.
|
3 |
The method used to create an index. Valid values:
- Create Index: If you select this method, the name of the Elasticsearch index that is automatically
created appears in the Elasticsearch Index Name column. You can click the name of the index to change the values of the parameters
related to the index.
- Use Existing Index: If you select this method, select the name of the desired index from the drop-down
list in the Elasticsearch Index Name column. Then, you can click View Field Mapping to view the mappings between the source tables and destination Elasticsearch indexes.
|
If you select
Create Index for
Index creation method, you can click the Elasticsearch index name that appears in the Elasticsearch Index
Name column to change the values of the parameters related to the destination Elasticsearch
index based on your business requirements.

- Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to
the destination Elasticsearch indexes during synchronization. Valid values:
- true: If the system detects that the source tables contain new fields, the system synchronizes
the fields to the mapped destination Elasticsearch indexes, and the fields can be
searched in the indexes after synchronization. Default value: true.
- false: If the system detects that the source tables contain new fields, the system synchronizes
the fields to the mapped destination Elasticsearch indexes, but the fields cannot
be searched in the indexes after synchronization.
- strict: If the system detects that the source tables contain new fields, the system does
not synchronize the fields to the mapped destination Elasticsearch indexes, and an
error is reported. You can view the details of the error in the node logs.
For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch.
- Shards and Replicas: the number of primary shards for the destination Elasticsearch index and the number
of replica shards for each primary shard. The shards are distributed on different
nodes in an Elasticsearch cluster to support distributed searches. This improves the
query efficiency of Elasticsearch. For more information, see Terms.
Note The values of Shards and Replicas cannot be changed after you specify them and the real-time synchronization solution
starts to run. The default values of Shards and Replicas are 1.
- Partition settings: You can use a column in a source table as a partition key column. This parameter
must be used together with the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box
is not selected.
- Data field structure: This section allows you to configure the types and extended attributes of the fields
in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch.
Note If you do not change the values of the parameters related to the destination Elasticsearch
indexes after the indexes are created, the system synchronizes data based on the default
values of the parameters.
- Click Next Step.
- Configure synchronization rules.
- In the Sync Rules step, select a synchronization method.

The following table describes the synchronization methods.
Method |
Description |
Only One-time Full Sync |
If you use this method, you need only to perform synchronization operations once to
synchronize all data in the source to Elasticsearch.
|
Only One-time Incremental Sync |
If you use this method, you need only to perform synchronization operations once to
synchronize incremental data in the source to Elasticsearch based on the specified
filter conditions.
|
Periodic Full Sync |
If you use this method, you must configure a scheduling cycle for the batch synchronization
solution. Then, the system synchronizes all data in the source to Elasticsearch each
time the system runs the solution based on the specified scheduling cycle.
|
Periodic Incremental Sync |
If you use this method, the system synchronizes only incremental data in the source
to Elasticsearch each time the system runs the solution based on the specified filter
conditions and scheduling cycle.
|
Incremental Sync after One-time Full Sync |
If you use this method, the system first synchronizes all data to Elasticsearch. Then,
the system synchronizes only incremental data in the source to Elasticsearch each
time the system runs the solution based on the specified filter conditions and scheduling
cycle.
|
- Configure parameters for the selected synchronization method.
The parameters that you need to specify in the
Full Sync,
Incremental Sync, and
Recurrence sections vary based on the synchronization method you selected. The following tables
describe the parameters.
- Full Sync
The parameters in this section are required only if you set
Solution to
Only One-time Full Sync,
Periodic Full Sync, or
Incremental Sync after One-time Full Sync.
Parameter |
Description |
Clear Index Data Before Writing |
Valid values:
- Yes: The original data in the destination Elasticsearch indexes is deleted before data
in the source is written to the indexes.
- No: The original data in the destination Elasticsearch indexes is retained before data
in the source is written to the indexes.
Notice If you set this parameter to Yes, all the original data in the destination Elasticsearch indexes is deleted before
data in the source is written to the indexes. Exercise caution when you set this parameter.
|
Write Policy |
Valid values:
- Insert: The system inserts data to the destination Elasticsearch indexes during data synchronization.
This is the default value of this parameter.
- Update: If the primary key field of a source table already exists in a destination Elasticsearch
index, the system first deletes a document in the destination Elasticsearch index
and then inserts data to the index. Otherwise, the system directly inserts data to
the destination Elasticsearch index.
|
Batch Size |
The number of data records that can be written to Elasticsearch at a time. Default
value: 1000. You can set this parameter to an appropriate value based on actual network conditions
and the data volume that you want to synchronize. This can reduce network overheads.
|
- Incremental Sync
The parameters in this section are required only if you set
Solution to
Only One-time Incremental Sync,
Periodic Incremental Sync, or
Incremental Sync after One-time Full Sync.
Parameter |
Description |
Write Policy |
Valid values:
- Insert: The system inserts data to the destination Elasticsearch indexes during data synchronization.
This is the default value of this parameter.
- Update: If the primary key field of a source table already exists in a destination Elasticsearch
index, the system first deletes a document in the destination Elasticsearch index
and then inserts data to the index. Otherwise, the system directly inserts data to
the destination Elasticsearch index.
|
Batch Size |
The number of data records that can be written to Elasticsearch at a time. Default
value: 1000. You can set this parameter to an appropriate value based on actual network conditions
and the data volume that you want to synchronize. This can reduce network overheads.
|
Incremental Condition |
The filter conditions that are used to filter data in the source to synchronize only
incremental data. You can configure filter conditions based on descriptions in Configure scheduling parameters.
|
- Recurrence
Parameter |
Description |
Recurrence |
The scheduling cycle of the batch synchronization solution. Valid values: Minute, Hour, Daily, Weekly, and Monthly. For more information about how to configure a scheduling cycle, see Configure a scheduling cycle.
|
Scheduling Period |
The batch synchronization solution is run only within the scheduling period that you
specified.
|
Pausing Scheduling |
If you select Pausing Scheduling, the batch synchronization solution is paused. In
this case, the solution starts to run based on the scheduling cycle until you cancel
the pausing. You can select this check box if you do not need to run the solution
for a period of time.
|
Rerun |
Valid values:
|
- Click Next Step.
- Configure the resources required for the synchronization solution.
In the
Set Resources for Solution Running step, configure the parameters.

- Full Sync
The parameters in this section are required only if you set
Solution to
Only One-time Full Sync,
Periodic Full Sync, or
Incremental Sync after One-time Full Sync in the
Sync Rules step.
Parameter |
Description |
Offline task name rules |
The name of the batch synchronization node that is used to synchronize the full data
of the source. After the synchronization solution is created, DataWorks generates
a batch node to synchronize the full data of the source.
|
Resource Group for Full Batch Sync Nodes |
Only exclusive resource groups for Data Integration can be used to run solutions.
You can set this parameter to the name of the exclusive resource group for Data Integration
that you purchased. For more information, see Plan and configure resources.
Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
|
- Incremental Sync
The parameters in this section are required only if you set
Solution to
Only One-time Incremental Sync,
Periodic Incremental Sync, or
Incremental Sync after One-time Full Sync in the Sync Rules step.
Parameter |
Description |
Naming Rule for Incremental Sync Nodes |
The name of the batch synchronization node that is used to synchronize the incremental
data of the source. After the synchronization solution is created, DataWorks generates
a batch synchronization node to synchronize the incremental data of the source.
|
Resource Group for Incremental Batch Sync Nodes |
Only exclusive resource groups for Data Integration can be used to run solutions.
You can set this parameter to the name of the exclusive resource group for Data Integration
that you purchased. For more information, see Plan and configure resources.
Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
|
- Scheduling Settings
Parameter |
Description |
Select scheduling Resource Group |
The resource group for scheduling that is used to run the nodes generated by the batch
synchronization solution.
Only exclusive resource groups for Data Integration can be used to run solutions.
You can set this parameter to the name of the exclusive resource group for Data Integration
that you purchased. For more information, see Plan and configure resources.
Note If you do not have an exclusive resource group, click Create a new exclusive Resource Group to create one.
|
Maximum number of connections supported by source read |
The maximum number of Java Database Connectivity (JDBC) connections that are allowed
for the source. Specify an appropriate number based on the resources of the source.
Default value: 15.
|
- Click Complete configuration. The batch synchronization solution used to synchronize all data in a database is
created.
Run the batch synchronization solution
On the Tasks page, find the created data synchronization solution and click Submit and Run in the Operation column to run the solution.
View the statuses and results of the synchronization nodes
- On the Tasks page, find the solution that is run and click Execution details in the Operation column. Then, you can view the details of all nodes generated by
the batch synchronization solution.

- Find a node whose details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to
go to the DataStudio page.
Manage the real-time synchronization solution
- View or edit the data synchronization solution.
On the
Tasks page, find the newly created synchronization solution and choose or choose in the Operation column. Then, you can view or modify the configurations of the batch
synchronization solution.
Note You can choose in the Operation column that corresponds to a batch synchronization solution in the
Not Running state to edit the batch synchronization solution. If you click Modify Configuration
in the Operation column that corresponds to a batch synchronization solution in another
state, you can only view information about the solution.
- Change the priority for the batch synchronization solution
Find the newly created batch synchronization solution and choose in the Operation column. In the
Change Priority dialog box, enter the desired priority and click
Confirm. You can set the priority to an integer from
1 to
8. A larger value indicates a higher priority.
Note If multiple batch synchronization solutions have the same priority, the system runs
them based on the order they are committed.
- Delete the batch synchronization solution.
Find the batch synchronization solution that you want to delete and choose in the Operation column. In the Delete message, click
OK.
Note After you click OK, only the configuration record of the batch synchronization solution
is deleted. The synchronization nodes generated by the solution and data tables generated
by the synchronization nodes are not affected.