This topic describes the factors that affect the speed of data synchronization, and how to adjust the concurrency of sync nodes to maximize the synchronization speed. This topic also describes bandwidth throttling settings, scenarios of slow data synchronization, and how to deal with slow data synchronization.

Data Integration is a one-stop platform that supports real-time and batch data synchronization between data stores in any location and in any network environment. You can synchronize data between various types of cloud storage and local storage each day.

DataWorks provides excellent data transmission performance and supports data exchanges between more than 400 pairs of heterogeneous data stores. These features allow you to focus on the key issues on constructing big data solutions.

Factors affecting the speed of data synchronization

The factors that affect the speed of data synchronization are listed as follows:
  • Source
    • Database performance: the performance of the CPU, memory, solid-state drive (SSD), network, and hard disk.
    • Concurrency: A high concurrency results in a heavy database workload.
    • Network: the bandwidth (throughput) and speed of the network. Generally, a database with higher performance can support more concurrent nodes and a larger concurrency value can be set for sync nodes.
  • Sync node
    • Synchronization speed: whether an upper limit is set for the synchronization speed.
    • Concurrency: a maximum number of concurrent threads to read data from the source and write data to destination data storage within the sync node.
    • Nodes that are waiting for resources.
    • Bandwidth throttling: The bandwidth of a single thread is 1,048,576 bit/s. Timeout occurs when the business is sensitive to the network speed. We recommend that you set a small bandwidth limit.
    • Whether to create an index for query statements.
  • Destination
    • Performance: the performance of the CPU, memory, SSD, network, and hard disk.
    • Load: Excessive load in the destination database affects the write efficiency within the sync nodes.
    • Network: the bandwidth (throughput) and speed of the network.

You need to monitor and optimize the performance, load, and network of the source and destination databases. The following sections describe the optimal settings of a sync node.

Concurrency

You can configure the concurrency for a node on the codeless user interface (UI). The following example shows how to configure the concurrency in the code editor:
"setting": {
      "speed": {
        "concurrent": 10
      }
    }   }

Bandwidth throttling

By default, bandwidth throttling is disabled. In a sync node, data is synchronized at the maximum transmission rate given the concurrency configured for the node. Considering that excessively fast synchronization may overstress the database and thus affect the production, Data Integration allows you to limit the synchronization speed and optimize the configuration as required. If bandwidth throttling is enabled, we recommend that you limit the maximum transmission rate to 30 Mbit/s. The following example shows how to configure an upper limit for synchronization speed in the code editor, in which the transmission bandwidth is 1 Mbit/s:
"setting": {
      "speed": {
         "throttle": true // Specifies that bandwidth throttling is enabled.
        "mbps": 1, // The synchronization speed.
      }
    }
Note
  • When the throttle parameter is set to false, bandwidth throttling is disabled, and you do not need to configure the mbps parameter.
  • The bandwidth value is a Data Integration metric and does not represent the actual network interface card (NIC) traffic. Generally, the NIC traffic is two to three times of the channel traffic, which depends on the serialization of the data storage system.
  • A semi-structured file does not have shard keys. If multiple files exist, you can set the maximum transmission rate of a node to increase the synchronization speed. However, the maximum transmission rate is limited by the number of files.
    Assume that the maximum transmission rate can be set to n Mbit/s for n files.
    • If you set the maximum transmission rate to (n+1) Mbit/s, the files are still synchronized at a speed of n Mbit/s.
    • If you set the maximum transmission rate to (n-1) Mbit/s, the files are synchronized at a speed of (n-1) Mbit/s.
  • A table in a relational database can be split based on the maximum transmission rate only after you set the maximum transmission rate and shard key. Usually, relational databases support only numeric-type shard keys. However, Oracle databases support numeric- and string-type shard keys.

Scenarios of slow data synchronization

  • Scenario 1: Resolve the issue that sync nodes to be run on the default resource group remain waiting for resources.
    • Example

      When you test a sync node in DataWorks, the node remains waiting for resources and an internal system error occurs.

      For example, a sync node is configured to synchronize data from ApsaraDB for Relational Database Service (RDS) to MaxCompute. The node has waited about 800 seconds before it is run. However, the log shows that the node runs for only 18 seconds and then stops. The sync node uses the default resource group. When you run other sync nodes, they also remain in the waiting state.

      The log is as follows:
      2017-01-03 07:16:54 : State: 2(WAIT) | Total: 0R 0B | Speed: 0R/s 0B/s | Error: 0R 0B | Stage: 0.0%
    • Handling method

      The default resource group is not exclusively used by a single user. Many nodes, not just two or three nodes of a single user, run on the default resource group. If resources are insufficient after you start to run a node, the node needs to wait for resources. In this case, the node is delayed for 800 seconds, and it only takes 18 seconds for the node to be run.

      To improve the synchronization speed and reduce the waiting time, we recommend that you run sync nodes during off-peak hours. Typically, most sync nodes are run between 00:00 and 03:00. You can avoid this time period to prevent your nodes from waiting for resources.

  • Scenario 2: Accelerate nodes that synchronize data from multiple source tables to the same destination table.
    • Example

      Multiple sync nodes are configured to run in sequence to synchronize data from tables of multiple data stores to the same destination table. However, the synchronization takes a long time.

    • Handling method
      To start multiple concurrent nodes that write data to the same destination database, pay attention to the following points:
      • Make sure that the destination database can support the running of all the concurrent nodes.
      • You can configure a sync node that synchronizes multiple source tables to the same destination table. Alternatively, you can configure multiple nodes to run concurrently in the same workflow.
      • If resources are insufficient, you can configure sync nodes to run during off-peak hours.
  • Scenario 3: A full table scan slows down the data synchronization because no index is added in the WHERE clause.
    • Example
      SQL statement:
      select bid,inviter,uid,createTime from `relatives` where createTime>='2016-10-2300:00:00'and reateTime<'2016-10-24 00:00:00';

      The sync node started to run at 11:01:24.875 on October 25, 2016 and started to return results from 11:11:05.489 on October 25, 2016. The synchronization program is waiting for the database to return SQL query results. However, it takes a long time before MaxCompute can respond.

    • Cause

      When the WHERE clause is used for a query, the createTime column is not indexed, resulting in a full table scan.

    • Handling method

      We recommend that you use an indexed column or add an index to the column that you want to scan if you use the WHERE clause.