Data Integration is an end-to-end data synchronization platform. With Data Integration, data can be synchronized in real time or on demand between any data sources and any network environment. You can replicate more than 10 TB of data across various types of cloud storage and local storage.
Data Integration enables rapid data transmission and data synchronization between more than 400 pairs of disparate data sources. This helps to build advanced analytics solutions and gain insights from the data.
This article provides information about the factors that affect data synchronization speed, methods to maximize synchronization speed by configuring data migration units (DMUs) and concurrency for synchronization tasks, throttling effects, and precautions for using a custom resource group.
The following factors can cause slow data synchronization:
Source database performance: CPU, memory, solid-state drive (SSD), network, and hard disk.
Concurrency: High concurrency results in high database load.
Network: Bandwidth (throughput) and speed.
A database with better performance can usually run with higher concurrency. Therefore, if the source database performance is good, you can run a synchronization task at high concurrency to read data from the database.
Synchronization task configurations
Synchronization speed: Whether a limit is set for the synchronization speed.
DMU: The amount of resources used for running the synchronization task.
Concurrency: A maximum number of threads that can be used to read data from the data source, or write data to the target data source at the same time in one synchronization task.
Tasks waiting for resources (in the WAIT status).
Bytes setting. If Bytes is set to 1048576, and the network is slow, the data transmission times out even before it is completed. We recommend that you set Bytes to a lower value.
Whether query is performed on indexed columns.
Performance: The performance of the CPU, memory module, SSD, network, and hard disk.
Extremely high load for the target database.
Network: The bandwidth (throughput) and speed of the network.
Synchronization tasks conf igurations are explained further in this article.
Target database performance: CPU, memory, SSD, network, and hard disk.
Load: High database load affects data write efficiency.
Network: Bandwidth (throughput) and speed.
You must pay attention to and optimize networking, and the performance and load of the data source and target data source. Subsequent sections describe the core configurations of a synchronization task.
DMU is used to measure the amount of resources, including CPU, memory, and network used for data integration. One DMU represents the minimum amount of resources used for a data synchronization task.
A data synchronization task can run using single or multiple DMUs. In Wizard mode, you can configure a maximum of 20 DMUs for a task.
The following is an example of how to set the number of DMUs in Script mode.
Note: If system performance is good, you can set the number of DMUs to more than 20 by using a Script mode. However, this may not improve system performance. Do not assign too many DMUs to a task.
If more DMUs are configured for a synchronization task, more resources are assigned to the task. However, this may not accelerate synchronization. To improve synchronization performance, you must configure both the concurrency and the number of DMUs to be used.
For example, if three threads are used to run one synchronization task at the same time, three DMUs are required. In this case, the synchronization speed is 10 Mbps, and it is sufficient for the three DMUs. Therefore, adding more DMUs will not accelerate synchronization.
Concurrency indicates the maximum number of threads that can be used to read data from the data source or write data to the target data source at the same time in one synchronization task.
In Wizard mode, you can configure concurrency for a task in the UI. In Script mode, you can use the following script to configure the concurrency.
A higher concurrency requires more DMUs. When network conditions and performance of data sources are quite satisfactory, more DMUs and higher concurrency lead to better synchronization speed.
You must pay attention to the following aspects.
To make sure that a task is successfully executed at high concurrency in Wizard mode, the highest concurrency allowed must not exceed the number of DMUs you set. For example, do not configure more than 10 concurrent threads when the number of DMUs is set to 10.
Excessively high concurrency may affect the performance of the source database. When you set a high concurrency, make sure that the data sources perform well at that concurrency.
In Script mode, you can set a high concurrency. However, the number of DMUs that can be provided for a task are limited. Do not set an excessively high concurrency.
After the beta phase of Data Integration has ended, throttling is disabled by default. In a synchronization task, data is synchronized at the maximum speed supported by the concurrency and DMUs configured for that task.
Excessively high speed may cause extremely high database load and affect data reading performance. Therefore, Data Integration provides throttling. You can determine whether to enable or disable throttling depending on database performance and network conditions. If throttling is enabled, we recommend that you limit the speed to 30 Mbps.
In Script mode, use the following script to enable or disable throttling.
"throttle": true // Throttling enabled
"mbps": 1, // Synchronization speed
Note: When the throttling parameter is set to false, throttling is disabled, and you do not need to configure the mbps parameter.
A custom resource group refers to a local network of servers used to run synchronization tasks even when the network is unavailable.
You must configure the number of DMUs required for a task, concurrency, and throttling, regardless of whether you run the task on your own servers or on the servers provided by Alibaba Cloud. However, when a custom resource group is used, Data Integration does not bill you for the DMUs used. Instead, it bills you for the time consumed by the task for execution. For more information, see Billing.
You must make sure that each server has at least a 2 GHz processor with four cores, 8 GB of RAM, and 80 GB of disk space, for a task to run properly on a custom resource group.
When you test synchronization tasks in DataWorks, single or multiple synchronization tasks are found to be waiting for resources, and internal system errors are reported.
For example, it takes 800 seconds for the execution of a synchronization task to complete. However, the log shows that the task runs for only 18 seconds. The default resource group is used. Other running synchronization tasks, which include hundreds of entries of data synchronized from RDS to MaxCompute, wait for the resources.
The log is shown as follows.
2017-01-03 07:16:54 : State: 2(WAIT) | Total: 0R 0B | Speed: 0R/s 0B/s | Error: 0R 0B | Stage: 0.0%
Shared resources are used by many projects. If shared resources are insufficient when a task is running, the task waits for resources. In this case, task execution period increases from 18 seconds to 800 seconds.
If fast synchronization is required, we recommend that you run synchronization tasks during off-peak hours. A large number of synchronization tasks usually run from 00:00 to 03:00 (UTC+8). Tasks usually do not need to wait for resources during other periods.
Data from multiple data sources must be imported into a table, and synchronization tasks are set to run in a sequence. Synchronization takes a long time to process.
You can start multiple tasks at the same time. Pay attention to the following aspects during task execution.
Make sure that the target database has sufficient capacity.
When you configure your workflow, you can create a single node, and data from multiple data sources is synchronized to the target table at the same time. This method can be used only when the data source is a MySQL database. Alternatively, you can create multiple nodes in your workflow to run synchronization tasks at the same time. On each node, data from one data source is synchronized to the target table.
If tasks are waiting for resources, run the tasks during off-peak hours so that the tasks can run with a higher execution priority.
The executed SQL statement is as follows.
select bid,inviter,uid,createTime from `relatives` where createTime>='2016-10-2300:00:00'and reateTime<'2016-10-24 00:00:00';
Query statement execution started at 2016-10-25 11:01:24.875 (UTC+8). Query result return started at 2016-10-25 11:11:05.489 (UTC+8). The synchronization program waits for the database to return the SQL query result. It takes a long time before data can be written into MaxCompute.
When the SQL WHERE clause is used, no indexes are set on the createTime column, resulting in a full table scan.
We recommend that you add an index to the column you want to scan if you want to use the SQL WHERE clause.