Optimize the performance of batch synchronization nodes - DataWorks

This topic describes the factors that affect the speed of data synchronization and how to adjust the parallelism for synchronization nodes to maximize the synchronization speed. This topic also describes bandwidth throttling settings and slow data synchronization scenarios.

Overview

The speed of data synchronization is affected by many factors such as the synchronization node configurations, database performance, and network. For more information, see Factors that affect the speed of data synchronization.
Data synchronization may be slow in all execution phases of a synchronization node. This topic describes the possible problems in the execution phases and the solutions to these problems. For more information, see Scenarios and solutions to slow data synchronization.
If the data synchronization performance of a database is limited, faster data synchronization is not necessarily good for the database and may overstress the database and affect the production. Data Integration provides the Bandwidth Throttling parameter, which can be used to control the data synchronization speed. You can configure this parameter based on your business requirements. For more information, see Limit the synchronization speed.

Factors that affect the speed of data synchronization

The speed of data synchronization is affected by many factors such as the source database, destination database, and synchronization node configurations. The factors that you can adjust and optimize are the source database performance, destination database performance, database loads, and network.

The following table describes the factors that affect the speed of data synchronization.


Factor	Description
Source	Database performance: the performance of the CPU, memory, SSD, network, and hard disk. Parallelism: A high parallelism results in a heavy database load. A database that has better performance can support more parallel threads. In this case, you can specify a large parallelism value for a synchronization node. Network: the bandwidth (throughput) and speed of the network.
Resource group for scheduling used to run a synchronization node	A resource group for scheduling is used to issue a synchronization node to a resource group for Data Integration for running. The resource usage of the resource group for scheduling affects the overall synchronization efficiency of the synchronization node. For more information about the node issuing mechanism, see Mechanism for issuing nodes.
Configurations of a synchronization node	Synchronization speed: whether an upper limit is specified for the synchronization speed. Parallelism: the maximum number of parallel threads that are used to read data from the source or write data to the destination. Whether the current node is waiting for resources. Bandwidth throttling: The bandwidth of a single thread is 1,048,576 Byte/s. If your business is sensitive to network speed, timeout occurs. We recommend that you specify a small bandwidth limit. Whether an index is created for query statements.
Destination	Performance: the performance of the CPU, memory, SSD, network, and hard disk. Load: Excessive loads on the destination database affect the write efficiency of the synchronization node. Network: the bandwidth (throughput) and speed of the network.

Scenarios and solutions to slow data synchronization

Note For more information about logs of batch synchronization nodes, see Scenario: Analyze run logs generated for a batch synchronization node.


Scenario	Problem description	Possible cause	Solution
Wait for scheduling resources	Problem description 1: The logs of a synchronization node show that the node is waiting for gateway resources. Problem description 2: The General tab of the directed acyclic graph (DAG) of an auto triggered node instance generated for the node shows that the instance waits for resources for a long period of time.	A resource group for scheduling is used to issue the synchronization node to a compute engine instance for running. Therefore, if the number of synchronization nodes that are run in parallel on the resource group for scheduling reaches the upper limit, the current node must wait until the synchronization nodes finish running and the resources used by the nodes are released.	On the Intelligent Diagnosis page, you can view the information about the nodes that are using the resources of the resource group while the current node is waiting for resources. Note If you use the shared resource group for scheduling, we recommend that you migrate the node to an exclusive resource group for scheduling for running.
Wait for resources in a resource group for Data Integration	The logs of a synchronization node show that the node is in the WAIT state.	The remaining resources in the resource group for Data Integration that you want to use to run the current synchronization node are insufficient to run the node. For example, a resource group for Data Integration supports a maximum of eight parallel threads. Three synchronization nodes are configured to run on the resource group for Data Integration. Three parallel threads are configured for each of the synchronization nodes. If two of the nodes are run in parallel on the resource group, the resource group for Data Integration can support two more parallel threads. In this case, the remaining node has to wait for resources in the resource group due to insufficient resources, and the logs of the node show that the node is in the WAIT state.	You can use one of the following methods to resolve the issue that other synchronization nodes in the resource group for Data Integration are using a large number of resources: Note On the Intelligent Diagnosis page, you can view the resource usage and the information about the nodes that are using the resources of the resource group while the current node is waiting for resources. The maximum number of parallel threads supported by a resource group for Data Integration varies based on the specifications of the resource group for Data Integration. For more information, see Overview of an exclusive resource group for Data Integration. Check whether the nodes that are using the resources in the resource group for Data Integration are blocked or significantly slow down. If the nodes have any issues, you must resolve the issues or suspend the nodes. If the nodes are not blocked, wait until the nodes finish running and the resources used by the nodes are released, and then run the current node. Find the nodes that are using the resources in the resource group for Data Integration and ask the owners of the nodes to reduce the parallel threads for the nodes. Reduce the parallel threads that you specified for the current batch synchronization node. Then, commit and deploy the node again. Scale out the resource group for Data Integration. For more information, see Scale out or in a resource group.
Slow data synchronization speed	The logs of a synchronization node show that the node is in the RUN state but the data synchronization speed is 0. If the synchronization node is running but remains in such a state for an extended period of time, we recommend that you click the link to the right of Detail log url to view the execution details of the node. If the value of the WaitReaderTime parameter is large in the node logs, it takes a long time to wait for data to be returned from the source.	An inappropriate shard key is specified. The SQL statements that are generated based on the shard key to read data from the source database are executed for an extended period of time. The SQL statements that are used to read data from the source database take an extended period of time to execute. For some Reader plug-ins, you need to configure the parameters such as where and querySql. The parameters slow down the execution of the SQL statements. Scenario: A full table scan slows down the data synchronization because no index is added in the WHERE clause. Database loads are high when the synchronization node is run. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.	Resolve the issue of the long execution time of SQL statements: Take note of the following items when you configure the SQL statements that you want to execute before or after synchronization: Make sure that an index is added for the fields that you use for data filtering. This prevents the synchronization node from scanning a full table. Prevent or reduce complex processing such as function-based processing. If complex processing is necessary, try to perform operations in the source or destination database before data synchronization. Check whether the volume of data in source tables is large. If it is, use multiple synchronization nodes to synchronize the data. Check the database logs to find the SQL statements that are blocked and contact the database administrator to resolve the issue. Check the loads on the source or destination database at the time when the synchronization node is run.
	The logs of a synchronization node show that the node is in the RUN state but the data synchronization speed is 0. In fact, the synchronization node is running. However, if you notice that the node remains in such a state for an extended period of time, we recommend that you click the link to the right of Detail log url to view the execution details of the node. If the value of the WaitWriterTime parameter is large in the node logs, it takes a long time to write data to the destination.	The SQL statements that you configure to execute before and after data is written to the destination database take an extended period of time to execute. For some Writer plug-ins, you need to configure the parameters such as preSql and postSql. The parameters slow down the execution of the SQL statements. Database loads are high when the synchronization node is run. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.
	The logs of a synchronization node show that the node is in the RUN state but the data synchronization speed is slow.	An inappropriate shard key is specified for a synchronization node that is used to synchronize data from a relational database. In this case, the parallelism settings become invalid for the node. The node runs a single thread to synchronize data. The maximum number of parallel threads is set to a small value. A large amount of dirty data is generated during data synchronization, which affects the synchronization speed. Database performance issues exist. Note A database that has better performance can support more parallel threads. In this case, you can specify a large parallelism value for a synchronization node. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.	Specify an appropriate shard key. For more information about how to specify a shard key for a synchronization node, see View run logs generated for a batch synchronization node. Specify the maximum number of parallel threads for each synchronization node that is run on a resource group for Data Integration based on the total number of parallel threads that are supported by the resource group. You can increase the number of parallel threads for the current synchronization node based on your business requirements. You can configure the parallelism for a synchronization node on the codeless UI. The following example shows how to configure the parallelism in the code editor: Note The maximum number of parallel threads supported by a resource group for Data Integration varies based on the specifications of the resource group. For more information, see Overview of an exclusive resource group for Data Integration. Process dirty data. For more information about the definition of dirty data, see Terms. If synchronization nodes are run in distributed execution mode, make sure that the value obtained after the number of parallel threads for the nodes is divided by the number of ECS instances in a resource group for Data Integration is less than or equal to the maximum number of parallel threads that are supported by a single ECS instance in the resource group for Data Integration. If you want to synchronize data across clouds or regions, we recommend that you synchronize data over an internal network after network connections are established. For information about network connectivity solutions, see Establish a network connection between a resource group and a data source. Check database loads.

Limit the synchronization speed

By default, bandwidth throttling is disabled. In a synchronization node, data is synchronized at the maximum transmission rate given the parallelism that is configured for the node. Excessively fast synchronization may overstress the database and thus affect the production. Therefore, Data Integration allows you to limit the synchronization speed and optimize the configuration based on your business requirements. If bandwidth throttling is enabled, we recommend that you limit the maximum transmission rate to 30 MB/s. The following example shows how to configure an upper limit for the synchronization speed in the code editor, in which the transmission rate is 1 MB/s:

"setting": {
      "speed": {
         "throttle": true // Indicates that bandwidth throttling is enabled. 
        "mbps": 1, // The synchronization speed. 
      }
    }

The valid values of the throttle parameter are true and false.
- If you set the throttle parameter to true, bandwidth throttling is enabled. In this case, you must configure the mbps parameter. If you do not configure the mbps parameter, an error is returned when the synchronization node is run or data is synchronized at an abnormal speed.
- If you set the throttle parameter to false, bandwidth throttling is disabled, and you do not need to configure the mbps parameter.
The bandwidth value is a Data Integration metric and does not represent the actual elastic network interface (ENI) traffic. In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic varies based on the serialization of the data storage system.
A semi-structured file does not have shard keys. If multiple files exist, you can configure the maximum transmission rate for a node to increase the synchronization speed. However, the maximum transmission rate is limited by the number of files.
For example, the maximum transmission rate for n files can be set to n MB/s.
- If you set the maximum transmission rate to (n + 1) MB/s, the files are still synchronized at a speed of n MB/s.
- If you set the maximum transmission rate to (n - 1) MB/s, the files are synchronized at a speed of (n - 1) MB/s.
A table in a relational database can be split based on the maximum transmission rate only after you specify the maximum transmission rate and shard key. In most cases, relational databases support only numeric-type shard keys. However, Oracle databases support numeric-type and string-type shard keys.

FAQ

What do I do if a full scan for a MaxCompute table slows down data synchronization because no index is added in the WHERE clause?.
The BatchSize or maxfilesize parameter determines the number of data records to synchronize at a time. We recommend that you specify an appropriate parameter value to reduce the interactions between Data Integration and databases and increase throughput. If you set the BatchSize or maxfilesize parameter to an excessively large value, an out of memory (OOM) error may occur during data synchronization. To resolve the issue, see Batch synchronization.

Appendix: Check the actual parallelism

On the log details page of a batch synchronization node, find the entry in the JobContainer - Job set Channel-Number to X channels format. In this example, the value of X is 2, which indicates that two parallel threads are used by the batch synchronization node. Check the actual parallelism

Appendix: Relationship between the parallelism and resource usage

This section describes the relationships between the parallelism and CPU utilization and between the parallelism and memory usage of exclusive resource groups.

Relationship between the parallelism and CPU utilization
For exclusive resource groups, the ratio of the parallelism to CPU utilization is 1:0.5. For example, if the exclusive resource group that you purchase uses an ECS instance with the specifications of 4 vCPUs and 8 GiB of memory, the parallelism of the exclusive resource group is 8. You can run a maximum of eight batch synchronization nodes with a parallelism of 1 or four batch synchronization nodes with a parallelism of 2 at a time.
If the node that you want to run on the exclusive resource group requires more threads than the threads available in the exclusive resource group, the node needs to wait until the nodes that are using the resources of the resource group finish running and sufficient threads are available for the node.
Note If the node that you want to run on the exclusive resource group requires more threads than the maximum number of threads that can be provided by the exclusive resource group, the node enters the state of waiting for resources. For example, if you want to run a node that requires 10 parallel threads on the exclusive resource group that uses an ECS instance with the specifications of 4 vCPUs and 8 GiB of memory, the node will permanently wait for resources. The exclusive resource group allocates resources to nodes based on the sequence in which the nodes are committed. Therefore, DataWorks cannot run nodes that are committed later than this node.
Relationship between the parallelism and memory usage
For exclusive resource groups, the minimum memory size that can be allocated to a synchronization node is calculated by using the following formula: 768 MB + (Parallelism - 1) × 256 MB. The maximum memory size that can be allocated to a synchronization node is 8,029 MB. If you specify a memory size when you configure the synchronization node, the specified memory size overrides the default settings of the exclusive resource group. If you configure the synchronization node in the code editor, you can specify a memory size for the node by configuring the jvmOption parameter in the $.setting.jvmOption path in the JSON configurations.
To ensure that all nodes that are running on an exclusive resource group can be run as expected, we recommend that you set the total memory size of the nodes to at least 1 GB less than the total memory size of all ECS instances that are deployed in the exclusive resource group. If this condition is not met, the Linux OOM killer forcefully stops the nodes that are running.
Note If you do not configure the required memory size in the code editor, you need to consider only the parallelism limits when you commit synchronization nodes.

Appendix: Synchronization speed

The read and write speeds vary based on data sources. This section describes the average speed for a single thread to read data from or write data to each type of data source.

Average speed for a single thread to write data to each type of data source


Writer	Average write speed for a single thread (KB/s)
AnalyticDB for PostgreSQL	147.8
AnalyticDB for MySQL	181.3
ClickHouse	5259.3
DataHub	45.8
DRDS	93.1
Elasticsearch	74.0
FTP	565.6
GDB	17.1
HBase	2395.0
HBase20xsql	37.8
HDFS	1301.3
Hive	1960.4
HybridDB for MySQL	323.0
HybridDB for PostgreSQL	116.0
Kafka	0.9
LogHub	788.5
MongoDB	51.6
MySQL	54.9
ODPS	660.6
Oracle	66.7
OSS	3718.4
OTS	138.5
PolarDB	45.6
PostgreSQL	168.4
Redis	7846.7
SQL Server	8.3
Stream	116.1
TSDB	2.3
Vertica	272.0

Average speed for a single thread to read data from each type of data source


Reader	Average read speed for a single thread (KB/s)
AnalyticDB for PostgreSQL	220.3
AnalyticDB for MySQL	248.6
DRDS	146.4
Elasticsearch	215.8
FTP	279.4
HBase	1605.6
HBase20xsql	465.3
HDFS	2202.9
Hologres	741.0
HybridDB for MySQL	111.3
HybridDB for PostgreSQL	496.9
Kafka	3117.2
LogHub	1014.1
MongoDB	361.3
MySQL	459.5
ODPS	207.2
Oracle	133.5
OSS	665.3
OTS	229.3
OTSStream	661.7
PolarDB	238.2
PostgreSQL	165.6
RDBMS	845.6
SQL Server	143.7
Stream	85.0
Vertica	454.3