This topic describes the factors that affect the speed of data synchronization, and how to adjust the concurrency for sync nodes to maximize the synchronization speed. This topic also describes bandwidth throttling settings and scenarios of slow data synchronization.

Data Integration of DataWorks is a one-stop platform that supports real-time and batch data synchronization between data stores in all regions and all network environments. You can synchronize data between various types of cloud storage and local storage each day.

DataWorks provides excellent data transmission performance and supports data synchronization between more than 400 pairs of heterogeneous data stores. These features allow you to focus on the key issues on constructing big data solutions.

Factors that affect the speed of data synchronization

The following factors affect the speed of data synchronization:
  • Source data store
    • Database performance: the performance of the CPU, memory, solid-state drive (SSD), network, and hard disk.
    • Concurrency: A high concurrency results in a heavy database workload.
    • Network: the bandwidth (throughput) and speed of the network. Generally, a database that has better performance can support more concurrent nodes and a greater concurrency value can be set for sync nodes.
  • Configuration of sync nodes
    • Synchronization speed: whether an upper limit is set for the synchronization speed.
    • Concurrency: the maximum number of concurrent threads to read data from the source data store and write data to the destination data store.
    • Nodes that are waiting for resources.
    • Bandwidth throttling: The bandwidth of a single thread is 1,048,576 bit/s. Timeout occurs when the business is sensitive to the network speed. We recommend that you set a small bandwidth limit.
    • Whether an index is created for query statements.
  • Destination data store
    • Performance: the performance of the CPU, memory, SSD, network, and hard disk.
    • Load: Excessive load in the destination database affects the write efficiency within the sync nodes.
    • Network: the bandwidth (throughput) and speed of the network.

You must monitor and optimize the performance, load, and network of the source and destination databases. The following sections describe the optimal configuration of a sync node in Data Integration.

Concurrency

You can configure the concurrency for a sync node on the codeless user interface (UI). The following example shows how to configure the concurrency in the code editor:
"setting": {
      "speed": {
        "concurrent": 10
      }
    }   }

Bandwidth throttling

By default, bandwidth throttling is disabled. In a sync node, data is synchronized at the maximum transmission rate given the concurrency that is configured for the node. Excessively fast synchronization may overstress the database and thus affect the production. Therefore, Data Integration allows you to limit the synchronization speed and optimize the configuration as required. If bandwidth throttling is enabled, we recommend that you limit the maximum transmission rate to 30 Mbit/s. The following example shows how to configure an upper limit for the synchronization speed in the code editor, in which the transmission bandwidth is 1 Mbit/s:
"setting": {
      "speed": {
         "throttle": true // Specifies that bandwidth throttling is enabled.
        "mbps": 1, // The synchronization speed.
      }
    }
  • The valid values of the throttle parameter are true and false.
    • If you set the throttle parameter to true, bandwidth throttling is enabled. In this case, you must set the mbps parameter. If you do not set the mbps parameter, an error is returned when the sync node is run or data is synchronized at an abnormal speed.
    • If you set the throttle parameter to false, bandwidth throttling is disabled, and you do not need to set the mbps parameter.
  • The bandwidth value is a Data Integration metric and does not represent the actual network interface card (NIC) traffic. Generally, the NIC traffic is two to three times of the channel traffic. The actual NIC traffic depends on the serialization of the data storage system.
  • A semi-structured file does not have shard keys. If multiple files exist, you can set the maximum transmission rate of a node to increase the synchronization speed. However, the maximum transmission rate is limited by the number of files.
    Assume that the maximum transmission rate can be set to n Mbit/s for n files.
    • If you set the maximum transmission rate to (n + 1) Mbit/s, the files are still synchronized at a speed of n Mbit/s.
    • If you set the maximum transmission rate to (n - 1) Mbit/s, the files are synchronized at a speed of (n - 1) Mbit/s.
  • A table in a relational database can be split based on the maximum transmission rate only after you set the maximum transmission rate and shard key. In general, relational databases support only numeric-type shard keys. However, Oracle databases support numeric- and string-type shard keys.

Scenarios of slow data synchronization

  • Scenario 1: Sync nodes to be run on shared resources for scheduling remain waiting for resources.
    • Sample scenario

      When you test a node in DataWorks, the node remains waiting for resources and an internal system error occurs.

      For example, you use the default resource group to run a sync node to synchronize data from ApsaraDB Relational Database Service (RDS) to MaxCompute. The node has waited about 800 seconds before it is run. However, the log shows that the node runs for only 18 seconds. When you run other sync nodes, they also remain in the waiting state.

      The following log is displayed:
      2017-01-03 07:16:54 : State: 2(WAIT) | Total: 0R 0B | Speed: 0R/s 0B/s | Error: 0R 0B | Stage: 0.0%
    • Solution

      The shared resources for scheduling that you use are not exclusively used by a single user. Many nodes, not only two or three nodes of a single user, are run on the shared resources for scheduling. If resources are insufficient after you start to run a node, the node must wait for resources. In this case, the node is delayed for 800 seconds, and it takes only 10 seconds for the node to be run.

      To improve the synchronization speed and reduce the waiting time, we recommend that you run sync nodes during off-peak hours. Generally, most sync nodes are run between 00:00 and 03:00. You can avoid this time period to prevent your nodes from waiting for resources.

  • Scenario 2: Accelerate nodes that synchronize data from multiple source tables to the same destination table.
    • Sample scenario

      Sync nodes are configured to run in sequence to synchronize data from tables of multiple data stores to the same destination table. However, the synchronization takes a long time.

    • Solution
      To start multiple concurrent nodes that write data to the same destination database, take note of the following points:
      • Make sure that the destination database can support the running of all the concurrent nodes.
      • You can configure a sync node that synchronizes data from multiple source tables to the same destination table. Alternatively, you can configure multiple nodes to concurrently run in the same workflow.
      • If nodes need to wait for resources when they are run, you can configure them to run during off-peak hours. This ensures that the nodes have a higher execution priority.
  • Scenario 3: A full table scan slows down the data synchronization because no index is added in the WHERE clause.
    • Sample scenario
      The following SQL statement is executed:
      select bid,inviter,uid,createTime from `relatives` where createTime>='2016-10-2300:00:00'and reateTime<'2016-10-24 00:00:00';

      The sync node started to run at 11:01:24.875 on October 25, 2016 and started to return results at 11:11:05.489 on October 25, 2016. The synchronization program is waiting for the database to return SQL query results. However, it takes a long time before MaxCompute can respond.

    • Cause

      When the WHERE clause is used for a query, the createTime column is not indexed, resulting in a full table scan.

    • Solution

      If you use the WHERE clause, we recommend that you use an indexed column or add an index to the column that you want to scan. This can improve performance.