What types of data sources support real-time synchronization?

For more information about the types of data sources that support real-time synchronization, see Data source types that support real-time synchronization.

Why is the Internet not recommended for real-time synchronization?

Real-time synchronization over the Internet has the following disadvantages:
  • Packet loss may occur, and the performance of data synchronization may be affected due to unstable network connection.
  • The security of data synchronization is low.

What operation does DataWorks perform on the data records that are synchronized in real time?

When Data Integration synchronizes data from a data source, such as a MySQL, Oracle, LogHub, or PolarDB data source, to a DataHub or Kafka data source in real time, Data Integration adds five fields to the data records in the destination. These fields are used for operations such as metadata management, sorting, and deduplication. For more information, see Fields used for real-time synchronization.

Why does my real-time synchronization node have high latency?

If some data that is generated by your real-time synchronization node cannot be queried, the node may have latency. You can go to the Real Time DI page in Operation Center to check whether the value that indicates the latency is excessively large.

The following table describes the possible reasons of high latency.

Problem description Cause Solution
High latency occurs on the source. A large number of data changes are made on the source.

If the latency spikes, the amount of data in the source increases at a specific point in time.

If the source contains a large amount of data and high data synchronization latency is caused by frequent data updates in the source, you can use one of the following solutions to resolve the issue:
  • Modify the configuration of the real-time synchronization node: You can adjust the number of parallel threads that can be used for data synchronization based on the number of databases or tables from which you want to read data and the maximum number of connections allowed for the source.
    Note You must make sure that the number of parallel threads after the adjustment does not exceed the maximum number of parallel threads that can be supported by your resource group. The maximum number of nodes that can be run in parallel on a resource group and the maximum number of parallel threads that is supported by a resource group vary based on the specifications of resource groups. For more information, see Overview. If you want to run a real-time synchronization node to synchronize data from an ApsaraDB RDS database, you can specify the number of parallel threads that can be used for data synchronization based on the maximum number of connections allowed for the ApsaraDB RDS database. If you want to run such a node to synchronize data from LogHub, you can specify the number of parallel threads that can be used for data synchronization based on the number of shards in the related Logstore.
  • Change the specifications of the resource group: If the amount of data in the source increases or the configuration of the data synchronization solution to which the real-time synchronization node belongs is modified, the resources in the resource group that you use are insufficient to synchronize the data in the source. In this case, you can upgrade the specifications of the resource group. If you modify the configuration of the data synchronization solution, you can change the number of source databases and tables. For example, if the original data synchronization solution is used to synchronize data from a single table in a database, after you modify the configuration of the solution, the solution can be used to synchronize data from multiple tables in multiple databases. For more information about how to change the specifications of a resource group, see Change the specifications of a resource group.
The offset from which data starts to be synchronized is much earlier than the current time. If the offset from which data starts to be synchronized is much earlier than the current time, an extended period of time is required to read the historical data before data can be read in real time.
High latency occurs on the destination. The performance of the destination is poor, or the loads on the destination are high. If the loads on the destination are high, you must contact the related database administrator. This issue cannot be resolved only by adjusting the number of parallel threads.
High latency occurs on the source. Data is synchronized over the Internet. The poor network connection causes the latency of the data synchronization node. If you synchronize data over the Internet, the timeliness of data synchronization cannot be ensured. We recommend that you establish network connections between the resource group that you use and your data sources and synchronize data over an internal network.
Note Real-time synchronization over the Internet has the following disadvantages: Packet loss may occur and the performance of data synchronization may be affected due to unstable network connection. Data security is not high.
If the performance of the source and that of the destination have large differences or the loads on the source or destination are high, data synchronization latency may be excessively high. If the loads on the source or destination are high, you must contact the related database administrators. This issue cannot be resolved only by adjusting the number of parallel threads.

When I run a node to synchronize data from Kafka in real time, the following error message is returned: Startup mode for the consumer set to timestampOffset, but no begin timestamp was specified. What do I do?

Specify an offset from which you want to synchronize data. Specify an offset

When I run a node to synchronize data from MySQL in real time, the following error message is returned: Cannot replicate because the master purged required binary logs. What do I do?

Data Integration cannot find the binary logs generated for the offset from which you want to synchronize data. You must check the retention duration of the binary logs of your MySQL data source and specify an offset within the retention duration when you start your synchronization node.
Note If Data Integration cannot find the binary logs, you can reset the offset to the current time.

When I run a node to synchronize data from MySQL, the following error message is returned: MySQLBinlogReaderException. What do I do?

The binary logging feature is disabled for the secondary MySQL database. If you want to synchronize data from the secondary MySQL database, you must enable this feature for the secondary database. To enable the feature, contact the administrator of the database.

For more information, see Enable the binary logging feature for the MySQL database.

When I run a node to synchronize data from MySQL, the following error message is returned: how master status' has an error!. What do I do?

If the detailed information about the error is Caused by: java.io.IOException: message=Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation, with command: show master status, the account that you specified when you add the MySQL data source to DataWorks is not granted permissions on the related MySQL database.

The account must be granted the SELECT, REPLICATION SLAVE, and REPLICATION CLIENT permissions on the MySQL database. For more information about how to grant an account the required permissions on a database, see Create an account and grant the required permissions to the account.

When I run a node to synchronize data from MySQL in real time, the following error message is returned: parse.exception.PositionNotFoundException: can't find start position forxxx. What do I do?

Data Integration cannot find the binary logs generated for the offset from which you want to synchronize data. You must reset an offset for the node.

When I run a node to synchronize data from MySQL in real time, data can be read at the beginning but cannot be read after a period of time. What do I do?

  1. Run the following command on the related MySQL database to view the binary log files that record the data write operation in the database:
    show master status 
  2. Search for journalName=mysql-bin.xx,position=xx in the binary log files of the MySQL database to check whether the binary log files contain data records about the offset specified by the position parameter. For example, you can search for journalName=mysql-bin.000001,position=50.
  3. Contact the database administrator if data is being written to the MySQL database but no data write operations are recorded in binary logs.

How do I deal with the TRUNCATE statement during real-time data synchronization?

Real-time synchronization supports the TRUNCATE statement. The TRUNCATE statement takes effect when full and incremental data is merged. If you do not execute the TRUNCATE statement, excessive data may be generated during data synchronization. If you do not execute the TRUNCATE statement, excessive data may be generated during data synchronization.

How do I improve the speed and performance of real-time synchronization?

If data is written to the destination at a low speed, you can increase the number of parallel threads for the destination and modify the values of the Java Virtual Machine (JVM) parameters. The values of the JVM parameters affect only the frequency of full heap garbage collection (Full GC). A large JVM heap memory reduces the frequency of Full GC and improves the performance of real-time synchronization.

When I run a node to synchronize data from Hologres in real time, the following error message is returned: permission denied for database xxx. What do I do?

Before you run a node to synchronize data from Hologres in real time, you must obtain the permissions of the <db>_admin user group in the Hologres console for your account. For more information, see Overview.

Can I directly run a real-time synchronization node on the codeless user interface (UI)?

You cannot directly run a real-time synchronization node on the codeless UI. You must commit and deploy the real-time synchronization node and run the node in the production environment. For more information, see O&M of real-time synchronization nodes.

When I run a real-time synchronization node to synchronize data to MaxCompute in real time, the following error message is reported: ODPS-0410051:invalid credentials-accessKeyid not found. What do I do?

If you use the default MaxCompute data source odps_first as the destination of the real-time synchronization node, the temporary AccessKey pair is used for data synchronization by default. The temporary AccessKey pair is valid for only seven days. After seven days, the temporary AccessKey pair is expired. In this case, the real-time synchronization node fails. If the system detects that the temporary AccessKey pair is expired, the system restarts the real-time synchronization node. If a related alert rule is configured for the node, the system reports an error.

Why are errors repeatedly reported when a real-time synchronization node is run to synchronize data to Oracle, PolarDB, or MySQL?

  • Problem description: When a real-time synchronization node is run to synchronize data to Oracle, PolarDB, or MySQL, errors are repeatedly reported.
    By default, data changes generated by DDL operations on the source cannot be synchronized to Oracle, PolarDB, or MySQL by using a real-time synchronization node. If data changes are generated by DDL operations other than CREATE TABLE in the source, the system reports an error for the real-time synchronization node and the node fails. In a resumable upload scenario, the following situation may exist: No DDL operations are performed on the source, but the system still reports an error for the real-time synchronization node.
    Note To prevent data loss or disorder within a specific period of time, we recommend that you do not use the rename command to exchange the name of one column with the name of another column. For example, if you use the rename command to exchange the name of Column A and that of Column B, data loss or disorder may occur.
  • Cause: Real-time synchronization supports resumable uploads. To ensure data integrity, after the real-time synchronization node is started, the node may read the data changes that are generated by previous DDL operations again. As a result, the error is reported again.
  • Solution:
    1. If data changes are generated by DDL operations in the source, manually make the same changes in the destination.
    2. Start the real-time synchronization node and change the processing rule for DDL messages from error reporting to ignoring.
      Note In a resumable upload scenario, the real-time synchronization node also subscribes to the DDL events. To ensure that the node can run as expected, you must temporarily change the processing rule for DDL messages from error reporting to ignoring.
    3. Stop the real-time synchronization node, change the processing rule for DDL messages from ignoring back to error reporting, and then restart the real-time synchronization node.