This topic describes the concurrency configurations and synchronization speed of batch sync nodes. This topic also describes the relationships between concurrency and resource usage.

Methods for configuring the concurrency

To adjust the resource usage and synchronization speed of a batch sync node, you can configure the concurrency of the batch sync node by using one of the following methods:
  • Configure the concurrency of the batch sync node by using the codeless user interface (UI). For more information, see Create a sync node by using the codeless UI.
    On the configuration tab of the batch sync node, set the Maximum number of concurrent tasks expected parameter in the Channel control section.Channel control section
  • Configure the concurrency of the batch sync node by using the code editor. For more information, see Create a sync node by using the code editor.
    Set the $.setting.speed.concurrent parameter in the JSON code.Configure the concurrency by using the code editor

The actual concurrency of the batch sync node may be smaller than the concurrency specified by using the codeless UI or code editor, depending on the purchased resources and the involved data stores.

To check the actual concurrency of the batch sync node, perform the following steps:
  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. In the top navigation bar, select the required region. Find the workspace where the batch sync node resides and click Operation Center in the Actions column.
  4. On the page that appears, choose Cycle Task Maintenance > Cycle Instance in the left-side navigation pane.
  5. Click the target instance. The directed acyclic graph (DAG) of the instance appears on the right. In the DAG, right-click the batch sync node that you want to view the concurrency and select View Runtime Log.View Runtime Log menu
  6. On the page that appears, click the link next to Detail log url.detail
  7. On the log details page of the batch sync node, find the entry in the JobContainer - Job set Channel-Number to 2 channels. format. In this example, the value of 2 indicates the number of concurrent threads that are used by the batch sync node.Concurrency

Relationships between the concurrency and resource usage

The following section describes the relationships between the concurrency and CPU usage and between the concurrency and memory usage for exclusive resource groups:
  • Relationships between the concurrency and CPU usage

    For exclusive resource groups, the ratio of the concurrency to CPU usage is 1:0.5. Assume that the exclusive resource group that you purchase resides on an Elastic Compute Service (ECS) instance with the specifications of 4 vCPU 8 GiB. The exclusive resource group uses eight concurrent threads for running sync nodes. This means that you can run eight batch sync nodes with a concurrency of 1 or four batch sync nodes with a concurrency of 2 at a time.

    If the node that you commit to the exclusive resource group requires more threads than that available in the resource group, the node needs to wait until one or more nodes stop running and sufficient threads are available for the node.
    Note If the node that you commit to the exclusive resource group requires more threads than the maximum number of threads that can be provided by the exclusive resource group, the node fails to run. For example, if you commit a node that requires 10 concurrent threads to the exclusive resource group that resides on an ECS instance with the specifications of 4 vCPU 8 GiB, the node will permanently wait for resources. The exclusive resource group allocates resources to nodes based on the sequence in which the nodes are committed. Therefore, DataWorks cannot run nodes that are committed later than this node.
  • Relationships between the concurrency and memory usage
    In an exclusive resource group, the minimum memory size that can be allocated to a sync node is calculated by using the formula: 768 MB + (Concurrency - 1) × 256 MB. The maximum memory size that can be allocated to a sync node is 8,029 MB. However, if you specify the memory size required by a sync node when you configure the sync node, the specified memory size overrides the default settings of the exclusive resource group. When you configure a sync node by using the code editor, you can specify the memory size by setting the $.setting.speed.jvmOption parameter in the JSON code.Memory size
    To ensure smooth running of all the nodes that are run on an exclusive resource group, the total memory size used by all running nodes must be 1 GB smaller than the total memory size of the ECS instance where the exclusive resource group resides. If this condition is not met, the Linux out-of-memory (OOM) killer forcibly stops the running nodes.
    Note If the required memory size is not modified in the code editor, you only need to consider the limits on the concurrency when you commit sync nodes.

Synchronization speed

The read and write speeds vary depending on the involved data stores. The following section describes the average speed for a thread to read data from or write data to each type of data store:
  • Average speed for a thread to write data to each type of data store
    Writer Average write speed (KB per second)
    AnalyticDB for PostgreSQL 147.8
    AnalyticDB for MySQL 181.3
    ClickHouse 5259.3
    DataHub 45.8
    DRDS 93.1
    Elasticsearch 74.0
    FTP 565.6
    GDB 17.1
    HBase 2395.0
    HBase11x 0.2
    hbase20xsql 37.8
    HDFS 1301.3
    Hive 1960.4
    Hologres 19.1
    HybridDB for MySQL 323.0
    HybridDB for PostgreSQL 116.0
    Kafka 0.9
    LogHub 788.5
    MongoDB 51.6
    MySQL 54.9
    MaxCompute 660.6
    Oracle 66.7
    OSS 3718.4
    Tablestore 138.5
    PolarDB 45.6
    PostgreSQL 168.4
    Redis 7846.7
    SQL Server 8.3
    Stream 116.1
    TSDB 2.3
    Vertica 272.0
  • Average speed for a thread to read data from each type of data store
    Reader Average read speed (KB per second)
    AnalyticDB for PostgreSQL 220.3
    AnalyticDB for MySQL 248.6
    DRDS 146.4
    Elasticsearch 215.8
    FTP 279.4
    HBase 1605.6
    hbase20xsql 465.3
    HDFS 2202.9
    Hologres 741.0
    HybridDB for MySQL 111.3
    HybridDB for PostgreSQL 496.9
    Kafka 3117.2
    LogHub 1014.1
    MongoDB 361.3
    MySQL 459.5
    MaxCompute 207.2
    Oracle 133.5
    OSS 665.3
    Tablestore 229.3
    OTSStream 661.7
    PolarDB 238.2
    PostgreSQL 165.6
    RDBMS 845.6
    SQL Server 143.7
    Stream 85.0
    Vertica 454.3