All Products
Search
Document Center

DataWorks:Offline synchronization FAQ

Last Updated:Nov 14, 2025

This topic describes common issues related to offline synchronization.

Document overview

You can search for keywords in this document to find common issues and their solutions.

Problem categorization

Problem keyword

References

Common O&M issues for offline sync tasks

Network communication issues

Why does the data source connectivity test succeed, but the offline sync task fails to connect to the data source?

Switch resource groups

How do I switch the execution resource group for an offline sync task?

Dirty data

Running timeout

How do I troubleshoot long-running offline sync tasks?

A data sync task slows down because the WHERE clause lacks an index, leading to a full table scan

Retain source table default values

If a source table has default values, will they be retained in a target table created by Data Integration, along with NOT NULL constraints?

Shard key

Can a composite primary key be used as a shard key when configuring an offline integration task?

Missing data

After data synchronization, the data in the target table is inconsistent with the source table

Causes and solutions for non-plugin errors

Dirty data

How do I handle dirty data errors caused by encoding format settings or garbled text?

SSRF attack

How do I handle the "Task have SSRF attacks" error?

Network communication issues

Why does an offline sync task sometimes succeed and sometimes fail?

Table/column name keywords

How do I handle sync task failures caused by keywords in table or column names?

Adding columns to a table

How do I handle adding or modifying columns in the source table of an offline sync task?

Date writing

When writing date and time data to a text file, how do I retain milliseconds or specify a custom date and time format?

Causes and solutions for specific plugin errors

MongoDB

OSS

Is there a limit on the number of files that can be read from OSS?

DataHub

When writing to DataHub, how do I handle a write failure caused by exceeding the single-write data limit?

Lindorm

When using the Lindorm bulk method to write data, is historical data replaced every time?

Elasticsearch

How do I query all fields under an ES index?

OTS Writer configuration

How do I configure OTS Writer to write data to a target table that contains an auto-increment primary key column?

Time series model configuration

In a time series model configuration, what do the _tag and is_timeseries_tag fields mean?

Offline synchronization scenarios and solutions

Custom table names

How do I customize table names for offline sync tasks?

MaxCompute

Task configuration issues

When configuring an offline sync node, I cannot view all tables. How do I handle this?

LogHub

Kafka

OSS

MySQL

Modify TTL

For a synchronized data table, can the TTL only be modified using the ALTER method?

Function aggregation

When synchronizing via API, is it supported to use source-side functions (e.g., MaxCompute) for aggregation? For example, if the source table has columns a and b as the primary key for Lindorm.

Elasticsearch

Field mapping

For an unstructured data source, fields cannot be mapped when I click Data Preview. How do I handle this?

Error messages and solutions

Resource setting issues

OSS

An error occurs when reading data from OSS: AccessDenied The bucket you access does not belong to you.

Redis

When writing to Redis in hash mode, an error occurs: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!

PostgreSQL

An error occurs when reading data from PostgreSQL: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

MySQL

Task instance conflict

An offline task fails with the error: Duplicate entry 'xxx' for key 'uk_uk_op'

Network communication issues

An offline sync task with a MySQL data source fails with a connection timeout error: Communications link failure

Field mapping

An offline task fails with the error: plugin xx does not specify column

MaxCompute

RestAPI

RestAPI Writer reports an error: The JSON string found through path:[] is not an array type

RDS

An offline sync task with an Amazon RDS source fails with the error: Host is blocked

MongoDB

Elasticsearch

Hive

An offline sync task to a local Hive instance fails with the error: Could not get block locations.

Running timeout

An offline sync task with a MongoDB source fails with the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

Network communication issues

Why does the data source connectivity test succeed, but the offline sync task fails to connect to the data source?

  • If the connectivity test passed previously, run it again to confirm that the resource group and the database are still connected. Ensure that no changes have been made to the database.

  • Check whether the resource group used for the connectivity test is the same as the one used for task execution.

    To view the resource group on which the task is running:

    • If the task runs on the default resource group, the log contains the following information: running in Pipeline[basecommon_ group_xxxxxxxxx]

    • If the task runs on an exclusive resource group for Data Integration, the log contains the following information: running in Pipeline[basecommon_S_res_group_xxx]

    • If the task runs on a Serverless resource group, the log contains the following information: running in Pipeline[basecommon_Serverless_res_group_xxx]

  • If a scheduled task fails intermittently in the early morning but succeeds on a rerun, check the database load at the time of the error.

Why does an offline sync task sometimes succeed and sometimes fail?

An offline sync task may fail intermittently if the whitelist configuration is incomplete. Check whether the database whitelist is fully configured.

When you use an exclusive resource group for Data Integration:

  • If you added the elastic network interface (ENI) IP addresses of the exclusive resource group for Data Integration to the data source's whitelist, you must update the whitelist after you scale out the resource group. Add the ENI IP addresses of the scaled-out resource group to the data source's whitelist.

  • To avoid updating the whitelist after scaling out the resource group, you can add the vSwitch CIDR block of the exclusive resource group for Data Integration to the database whitelist. For more information, see Add a whitelist.

When you use a Serverless resource group: For more information, see Add a whitelist to check the whitelist configuration of the resource group and ensure that the network settings are correct.

If the whitelist is configured correctly, check whether the database load is too high. A high load can cause the connection to be interrupted.

Resource setting issues

An offline sync task fails with the error: [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.

  • Possible cause:

    The concurrency level is set too high, which results in insufficient resources.

  • Solution:

    Reduce the concurrency of the offline sync task.

    • When configuring an offline sync task in the codeless UI, reduce the value of Maximum Concurrency in the channel control settings. For more information, see Configure a sync node in the codeless UI.

    • When configuring an offline sync task in the code editor, reduce the value of the concurrent parameter in the channel control settings. For more information, see Configure a sync node in the code editor.

An offline sync task fails with the error: OutOfMemoryError: Java heap space

To resolve this error, perform the following steps:

  1. If the plugin configuration supports parameters such as `batchSize` or `maxFileSize`, reduce their values.

    To check which plugins support these parameters, go to Supported data sources and reader and writer plugins and click the relevant plugin to view its parameter details.

  2. Reduce the concurrency.

    • When you configure an offline sync task in the codeless UI, reduce the value of Maximum Concurrency in the channel control settings. For more information, see Configure a sync node in the codeless UI.

    • If you configure an offline sync task in the code editor, reduce the value of the concurrent parameter in the channel control settings. For more information, see Configure a sync node in the code editor.

  3. For file synchronization, such as synchronizing files from OSS, reduce the number of files to be read.

  4. In the Running Resources section of the task configuration, increase the Resource Consumption (CU) value to an appropriate level to avoid affecting other running tasks.

Task instance conflict

An offline task fails with the error: Duplicate entry 'xxx' for key 'uk_uk_op'

  • Error message: Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op'.

  • Possible cause: Data Integration does not allow different instances of the same sync task node to run concurrently. This means multiple sync tasks that have the same JSON configuration cannot run at the same time. For example, a sync task is scheduled to run every 5 minutes and experiences an upstream delay. At 00:05, both the instance for 00:00 and the instance for 00:05 are triggered. This can cause one instance to fail. Conflicts can also occur if a data backfill or rerun operation is performed while a task instance is running.

  • Solution: Stagger the running times of the instances. For tasks that run every hour or minute, you can set a self-dependency. This ensures that the instance for the current cycle starts only after the instance for the previous cycle is complete. For the previous version of Data Development, see Self-dependency. For the new version of Data Development, see Select a dependency type (cross-cycle dependency).

Running timeout

An offline sync task with a MongoDB source fails with the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

  • Error message: A data sync task fails with the error MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

  • Possible cause: The amount of data being pulled is too large.

  • Solution:

    • Increase the concurrency.

    • Decrease the BatchSize.

    • In the Reader parameter settings, add the cursorTimeoutInMs configuration. Try setting a larger value, such as 3,600,000 ms.

An offline sync task with a MySQL data source fails with a connection timeout error: Communications link failure

  • Read error

    • Symptom:

      When data is read, the following error occurs: Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

    • Possible cause:

      The database is executing the SQL query slowly, which causes a MySQL read timeout.

    • Solution:

      • Check whether a where filter condition is set and ensure that an index has been added to the filtered field.

      • Check whether the source data table contains too much data. If it does, split the task into multiple smaller tasks.

      • Find the blocking SQL statement in the logs and consult the database administrator (DBA) to resolve the issue.

  • Write error

    • Symptom:

      When data is written, the following error occurs: Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More...

    • Possible cause:

      A slow query is causing a socket timeout. The default socket timeout for TDDL connections is 12 seconds. If an SQL statement takes more than 12 seconds to execute on the MySQL side without returning a result, a 4614 error is reported. This error may occur intermittently when the data volume is large or the server is busy.

    • Solution:

      • Rerun the sync task after the database stabilizes.

      • Contact the database administrator to adjust the timeout period.

How do I troubleshoot long-running offline sync tasks?

Possible cause 1: Long execution time

  • Pre-execution statements or post-execution statements, such as `preSql` and `postSql`, take too long to execute in the database, which causes the task to run slowly.

  • The shard key is not configured properly, which causes the task to run slowly.

    Offline synchronization uses a shard key (`splitPk`) to shard data. The data synchronization process starts concurrent tasks based on this configuration to improve efficiency. Check the relevant plugin documentation to determine whether a shard key needs to be configured.

Solution 1:

  • If you configure pre-execution or post-execution statements, use indexed fields for data filtering.

  • If the plugin supports a shard key, configure it properly. The following information describes how to configure the shard key for the MySQL Reader plugin:

    • Use the table's primary key for `splitPk`. The primary key is usually distributed evenly, which helps prevent data hot spots in the created shards.

    • Currently, `splitPk` supports only integer data for sharding. It does not support other data types, such as strings, floating-point numbers, or dates. If you specify an unsupported data type, a single channel is used for synchronization.

    • If you do not specify a `splitPk` or if the `splitPk` value is empty, data synchronization uses a single channel to sync the table data.

Possible cause 2: The task is waiting for Data Integration execution resources

Solution 2: If the log shows a prolonged `WAIT` status, the exclusive resource group for Data Integration that is used by the task does not have enough available concurrent resources to run the task. For more information about the cause and solution, see Why does a Data Integration task always show the "wait" status?.

Note

An offline sync task is dispatched from a scheduling resource group to a Data Integration execution resource group. Therefore, an offline sync task consumes one scheduling resource. If an offline sync task runs for a long time without releasing its resources, it blocks not only other offline tasks but also other types of scheduled tasks.

A data sync task slows down because the WHERE clause lacks an index, leading to a full table scan

  • Scenario example

    The executed SQL statement is as follows.

    SELECT bid,inviter,uid,createTime FROM `relatives` WHERE createTime>='2016-10-2300:00:00' AND reateTime<'2016-10-24 00:00:00';

    Execution starts at 2016-10-25 11:01:24.875 and results start returning at 2016-10-25 11:11:05.489. The sync program waits for the database to return the SQL query results. As a result, MaxCompute must wait a long time before it can execute.

  • Cause analysis

    When querying with the `WHERE` clause, the createTime column has no index, which leads to a full table scan.

  • Solution

    Use indexed columns in the where clause to improve performance. You can also add indexes to the required columns.

Switch resource groups

How do I switch the execution resource group for an offline sync task?

For the previous version of Data Development:

You can modify the resource group for debugging on the offline sync task details page in DataStudio. You can also modify the Data Integration execution resource group for task scheduling in Operation Center. For more information, see Switch a Data Integration resource group.

For the new version of Data Development:

You can modify the resource group used for debugging Data Integration tasks in DataStudio. You can also modify the execution resource group for scheduled Data Integration tasks in Operation Center. For more information, see Resource group O&M.

Dirty data

How do I troubleshoot and locate dirty data?

Definition of dirty data: A single data record that causes an exception when it is written to the target data source is considered dirty data. Any data that fails to be written is classified as dirty data.

Impact of dirty data: Dirty data is not written to the destination. You can control whether to allow dirty data and set a threshold for the number of dirty data records. By default, Data Integration allows dirty data. You can specify the number of allowed dirty data records when you configure the sync task. For more information, see Configure a sync node in the codeless UI.

  • If a task is set to allow dirty data: When dirty data is generated, the task continues to run, but the dirty data is discarded and not written to the destination.

  • Allowed number of dirty data records:

    • If the number of allowed dirty data records is 0, the task fails and stops when the first dirty data record is generated.

    • If the number of allowed dirty data records is set to x, the task fails and stops if the number of dirty data records exceeds x. If the number of dirty data records is less than x, the task continues to run, but the dirty data is discarded and not written to the destination.

Dirty data scenarios:

  • Scenario 1:

    • Error message: {"message":"Dirty data was encountered when writing to the ODPS destination table: An error occurred in the data of the [3rd] field. Please check the data and modify it, or you can increase the threshold to ignore this record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}]}.

    • Solution: The log indicates that the dirty data is in the third field.

      • The writer reports the dirty data. Check the table creation statement for the writer. This error can occur if the field size specified for the table in ODPS is smaller than the data size of the corresponding field in MySQL.

      • Data synchronization principle: Data from the source data source must be writable to the destination data source. This means the source and destination data types must match, and the defined field sizes must be compatible. For example, data of the `VARCHAR` type from the source cannot be written to a target column of the `INT` type. The data type size defined at the destination must be able to accommodate the actual data size of the mapped field from the source. For source data of the `LONG`, `VARCHAR`, or `DOUBLE` type, the destination can use large-range types such as `string` or `text`.

      • If the cause of the dirty data error is not clear, copy the entire dirty data record from the log. Then, examine the data and compare it with the destination data types to identify which part does not conform to the specification.

      For example:

      {"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}

      In this example, `byteSize` is the number of bytes, `index: 25` indicates the 26th field, `rawData` is the specific value, and `type` is the data type.

  • Scenario 2:

    • Symptom: DataX reports dirty data when it reads null values from MySQL.

    • Solution: Check whether the field with a null value in the source matches the data type of the mapped field in the target table. An error is reported if the data types are inconsistent. For example, an error occurs if you try to write a null value of the string type to a target field of the int type.

How do I view dirty data?

You can go to the log details page and click Detail log url to view the offline sync log and dirty data information.查看日志

If the number of dirty data records exceeds the limit during an offline sync task, will the data that has already been synchronized be retained?

The number of dirty data records is accumulated during task execution. If this number exceeds the specified dirty data limit, the task is immediately aborted.

  • Data retention: Data that was successfully written to the destination before the task was aborted is retained. No rollback operation is performed.

  • Zero-tolerance policy: If the dirty data limit is set to 0, the system adopts a zero-tolerance policy. This means the task fails and stops immediately after it detects the first dirty data record.

How do I handle dirty data errors caused by encoding format settings or garbled text?

  • Symptom:

    If the data includes emojis, a dirty data error may be reported during synchronization: [13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"}],"type":"writer"} .

  • Possible cause:

    • The database encoding is not set to `utf8mb4`, which causes an error when synchronizing emojis.

    • The source data itself is garbled.

    • The database and the client use different character sets.

    • The browser uses a different character set, which causes a preview failure or garbled text.

  • Solution:

    Choose the appropriate solution based on the cause of the garbled text:

    • If your raw data is garbled, process the raw data before you run the sync task.

    • If the database and client use inconsistent character sets, unify the character sets first.

    • If the browser uses a character set that is inconsistent with the database or client character set, unify the character sets first, and then preview the data.

    You can try the following operations:

    1. For a data source added using the Java Database Connectivity (JDBC) format, modify the connection string to use `utf8mb4`: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    2. For a data source added using an instance ID, append the parameter after the database name: database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    3. Modify the relevant database encoding format to `utf8mb4`. For example, you can modify the RDS database encoding format in the RDS console.

      Note

      Command to set the RDS data source encoding format: set names utf8mb4. Command to view the RDS database encoding format: show variables like 'char%'.

Retain source table default values

If a source table has default values, will they be retained in a target table created by Data Integration, along with NOT NULL constraints?

When DataWorks creates a target table, it retains only the column names, data types, and comments from the source table. It does not retain the source table's default values or constraints, such as `NOT NULL` constraints and indexes.

Shard key

Can a composite primary key be used as a shard key when configuring an offline integration task?

No. Offline integration tasks do not support using a composite primary key as a shard key.

Missing data

After data synchronization, the data in the target table is inconsistent with the source table

If you encounter data quality issues after data synchronization is complete, see Troubleshoot data quality issues in offline synchronization for detailed troubleshooting.

SSRF attack

SSRF attacks in tasks Task have SSRF attacks: How to resolve?

Q: How do I handle the "Task have SSRF attacks" error?

Cause: To ensure cloud security, DataWorks prohibits tasks from directly accessing internal network addresses in the cloud environment using public IP addresses. This security measure is triggered when a URL that points to an internal IP address or a VPC domain name is entered in a plugin configuration, such as HTTP Reader.

Solution:

For tasks that run on internal data sources, stop using public resource groups. Instead, switch to a secure and reliable Serverless resource group (recommended) or an exclusive resource group for Data Integration.

Date writing

When writing date and time data to a text file, how do I retain milliseconds or specify a custom date and time format?

After you switch the sync task to the code editor, add the following configuration item in the setting section of the task configuration page:

"common": {
  "column": {
    "dateFormat": "yyyyMMdd",
    "datetimeFormatInNanos": "yyyyMMdd HH:mm:ss.SSS"
  }
}

image.png

Where:

  • `dateFormat` specifies the date format to use when a source `DATE` type (without time) is converted to text.

  • `datetimeFormatInNanos` specifies the date format to use when a source `DATETIME` or `TIMESTAMP` type (with time) is converted to text, with precision up to the millisecond.

MaxCompute

Notes on "Add Row" or "Add Field" for source table fields when reading data from a MaxCompute (ODPS) table

  1. You can enter constants. The values must be enclosed in single quotation marks, such as 'abc' or '123'.

  2. You can use them with scheduling parameters, such as '${bizdate}'. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

  3. You can enter the names of the partition columns that you want to synchronize, such as `pt`.

  4. If the value that you enter cannot be parsed, the type is displayed as 'Custom'.

  5. You cannot configure ODPS functions.

  6. If a manually added column is displayed as custom, such as a MaxCompute partition column or a column that is not previewed in LogHub data, the task execution is not affected.

When reading data from a MaxCompute (ODPS) table, how do I synchronize partition fields?

In the field mapping list, under Source table fields, click Add Row or Add Field. Enter the partition column name, such as pt, and configure its mapping to the target table field.

image

When reading data from a MaxCompute (ODPS) table, how do I synchronize data from multiple partitions?

Specify the partitions from which you want to read data.

  • The ODPS partition configuration supports Linux shell wildcard characters. An asterisk (*) represents zero or more characters, and a question mark (?) represents a single character.

  • By default, a task reports an error when it attempts to read a non-existent partition. To allow the task to succeed even if a partition does not exist, you can set the When Partition Does Not Exist parameter to Ignore non-existent partitions. Alternatively, you can switch to the code editor and add the "successOnNoPartition": true configuration to the ODPS parameters.

For example, if the partitioned table `test` contains four partitions: `pt=1,ds=hangzhou`, `pt=1,ds=shanghai`, `pt=2,ds=hangzhou`, and `pt=2,ds=beijing`, the configurations for reading data from different partitions are as follows:

  • To read data from the `pt=1,ds=hangzhou` partition, the partition information configuration is "partition":"pt=1,ds=hangzhou".

  • To read data from all partitions where `pt=1`, the partition information configuration is "partition":"pt=1,ds=*".

  • To read data from all partitions in the `test` table, the partition information configuration is "partition":"pt=*,ds=*".

You can also set conditions to retrieve partition data. The following operations require you to configure the task in the code editor:

  • To specify the maximum partition, you can add the /*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR) configuration.

  • To filter by a condition, you can add a condition such as a /*query*/ pt+expression configuration. For example, /*query*/ pt>=20170101 and pt<20170110 retrieves all data from the `pt` partition after January 1, 2017 (inclusive) and before January 10, 2017 (exclusive).

Note

/*query*/ indicates that the content that follows is recognized as a `WHERE` condition.

How do I perform column filtering, reordering, and null padding in MaxCompute?

By configuring MaxCompute Writer, you can perform operations that MaxCompute does not support, such as column filtering, reordering, and null padding. For example, if the list of fields to be imported includes all fields, you can configure it as "column": ["*"].

If a MaxCompute table has three fields, `a`, `b`, and `c`, and you want to synchronize only fields `c` and `b`, you can configure the columns as "column": ["c","b"]. This means the first and second columns from the reader are imported into the `c` and `b` fields of the MaxCompute table, and the `a` field in the newly inserted row of the MaxCompute table is set to null.

Handling MaxCompute column configuration errors

To ensure the reliability of written data and avoid data quality issues caused by the loss of extra column data, MaxCompute Writer reports an error if it detects extra columns. For example, if a MaxCompute table has fields `a`, `b`, and `c`, and MaxCompute Writer attempts to write more than three columns, an error is reported.

Notes on MaxCompute partition configuration

MaxCompute Writer supports writing data only to the last-level partition. It does not support features such as partition routing based on a specific field. For example, if a table has three levels of partitions, the partition configuration must explicitly specify writing to a third-level partition. For example, to write data to the third-level partition of a table, you can configure it as pt=20150101, type=1, biz=2, but not as pt=20150101, type=1 or pt=20150101.

MaxCompute task reruns and failover

MaxCompute Writer ensures write idempotence when you configure "truncate": true. This means that if a write operation fails and the task is run again, MaxCompute Writer clears the previous data and imports the new data to ensure data consistency after each rerun. If the task is interrupted by other exceptions during execution, data atomicity cannot be guaranteed. The data is not rolled back or automatically rerun. You must use the idempotence feature and rerun the task to ensure data integrity.

Note

If `truncate` is set to `true`, all data in the specified partition or table is cleared. Use this parameter with caution.

An error occurs when reading data from a MaxCompute (ODPS) table: The download session is expired.

  • Error message:

    Code:DATAX_R_ODPS_005:Failed to read ODPS data, Solution:[Please contact the ODPS administrator]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired.

  • Possible cause:

    Offline synchronization uses the MaxCompute Tunnel command to upload and download data. The lifecycle of a Tunnel session on the server is 24 hours. Therefore, if an offline sync task runs for more than 24 hours, the task fails and stops. For more information about Tunnel, see Instructions.

  • Solution:

    Increase the concurrency of the offline sync task and plan the amount of data to be synchronized to ensure that the task can be completed within 24 hours.

An error occurs when writing to MaxCompute (ODPS): Error writing request body to server

  • Error message:

    Code:[OdpsWriter-09], Description:[Failed to write data to the ODPS destination table.]. - ODPS destination table write block:0 failed, uploadId=[202012081517026537dc0b0160354b]. Please contact the ODPS administrator. - java.io.IOException: Error writing request body to server.

  • Possible cause:

    • Possible cause 1: Data type exception. The source data does not conform to the ODPS data type specification. For example, you are trying to write the value `4.2223` to a field of the ODPS `decimal(18,10)` data type.

    • Possible cause 2: ODPS block or communication exception.

  • Solution:

    Convert the data to a data type that conforms to the data type specification.

MySQL

How do I synchronize sharded MySQL tables into a single MaxCompute table?

You can configure this by following the instructions in Synchronize sharded tables.

When the destination MySQL table's character set is utf8mb4, how do I handle garbled Chinese characters after synchronization?

When you add a data source using a connection string, modify the JDBC format to: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45. For more information, see Configure a MySQL data source.

An error occurs when writing to or reading from MySQL: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server.

  • Cause:

    • `net_read_timeout`: DataX splits the data from RDS for MySQL into several equal data retrieval SQL statements (`SELECT` statements) based on `SplitPk`. When the execution time of an SQL statement exceeds the maximum allowed running time on the RDS side, this error occurs.

    • `net_write_timeout`: The timeout period for waiting to send a block to the client is too short.

  • Solution:

    Add the `net_write_timeout` or `net_read_timeout` parameter to the data source URL connection and set a larger value, or adjust this parameter in the RDS console.

  • Suggestion:

    If the task can be rerun, set the task to automatically rerun on error.

数据源参数设置

For example: jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000

An offline sync task to MySQL fails with an error: [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the configured destination table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed

Cause:

The default value for the wait_timeout parameter in MySQL is 8 hours. If data is still being fetched when this time is reached, the sync task is interrupted.

Solution:

Modify the MySQL configuration file my.cnf (or my.ini on Windows). Under the MySQL module, add the following parameters (in seconds): wait_timeout=2592000 interactive_timeout=2592000. Then, restart and log on to MySQL and run the following statement to check if the setting was successful: show variables like '%wait_time%'.

An error occurs when reading from a MySQL database: The last packet successfully received from the server was 902,138 milliseconds ago

Normal CPU usage but high memory usage may cause the connection to be disconnected.

If you confirm that the task can be automatically rerun, set the task to Rerun upon Error. For more information, see Time property configuration.

PostgreSQL

An error occurs when reading data from PostgreSQL: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

  • Scenario: When you synchronize PostgreSQL data using an offline sync tool, the following error occurs: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

  • Possible cause: This occurs because pulling data from the database takes a long time. Increase the values of the max_standby_archive_delay and max_standby_streaming_delay parameters. For more information, see Standby Server Events.

RDS

An offline sync task with an Amazon RDS source fails with the error: Host is blocked

If connecting to Amazon RDS returns Host is blocked, you need to disable the Amazon load balancer health check. After you disable it, the block issue is no longer reported.

MongoDB

Why do I get an error when adding a MongoDB data source with the root user?

An error occurs when you add a MongoDB data source with the root user because you must use a username that is created in the database of the table you need to synchronize. You cannot use the root user.

For example, to import the `name` table, and the `name` table is in the `test` database, the database name here must be `test`, and you must use the username of a user created in the `test` database.

When reading from MongoDB, how do I use a timestamp in the query parameter to perform incremental synchronization?

You can use an assignment node to first process the date type time into a timestamp and use this value as an input parameter for MongoDB data synchronization. For more information, see How to perform incremental synchronization on a timestamp type field in MongoDB?

After synchronizing data from MongoDB to a destination data source, the time zone is shifted by +8 hours. How do I fix this?

You need to set the time zone in the MongoDB Reader configuration. For more information, see MongoDB Reader.

During data reading from MongoDB, records are updated in the source but not synchronized to the destination. How do I handle this?

You can restart the task after a period of time with the query condition unchanged. This means delaying the execution time of the sync task without changing the configuration.

Is MongoDB Reader case-sensitive?

When reading data, the Column.name configured by the user is case-sensitive. If it is configured incorrectly, the read data is null. For example:

  • MongoDB source data:

    {
        "MY_NAME": "zhangsan"
    }
  • Sync task's Column configuration:

    {
        "column":
        [
            {
                "name": "my_name"
            }
        ]
    }

Because the case in the sync task's configuration does not match the source data, a data reading exception occurs.

How do I configure the timeout duration for MongoDB Reader?

The timeout configuration parameter is cursorTimeoutInMs, with a default value of 600,000 ms (10 minutes). This parameter represents the total time the MongoDB server takes to execute a query, not including data transmission time. If the amount of data read in a full pull is large, it may cause the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

An error occurs when reading from MongoDB: no master

Currently, DataWorks sync tasks do not support reading data from a secondary node. If you configure reading from a secondary node, the error no master is reported.

An error occurs when reading from MongoDB: MongoExecutionTimeoutException: operation exceeded time limit

  • Cause analysis:

    This error is caused by a cursor timeout.

  • Solution:

    Increase the value of the cursorTimeoutInMs parameter.

An offline sync task reading from MongoDB fails with the error: DataXException: operation exceeded time limit

You need to increase the task concurrency and the `BatchSize` for reading.

A MongoDB sync task fails with the error: no such cmd splitVector

  • Possible cause:

    When a sync task runs, it uses the splitVector command for task sharding by default. Some MongoDB versions do not support the splitVector command, which leads to the no such cmd splitVector error.

  • Solution:

    1. Go to the sync task configuration page and click the Convert to script 转换脚本 button. Change the task to the code editor.

    2. In the MongoDB parameter configuration, add the parameter

      "useSplitVector" : false

      to avoid using splitVector.

A MongoDB offline sync task fails with the error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"

  • Symptom:

    This issue may occur in a sync task, such as in the codeless UI, if Write Mode (Overwrite) is set to Yes and a field other than _id is configured as the Business Primary Key.写入模式报错

  • Possible cause:

    The data contains records whose _id does not match the configured Business Primary Key (such as my_id in the example configuration above).

  • Solution:

    • Solution 1: Modify the offline sync task to ensure that the configured Business Primary Key is consistent with the _id.

    • Solution 2: Use _id as the business primary key for data synchronization.

Redis

When writing to Redis in hash mode, an error occurs: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!

  • Cause:

    When Redis uses hash mode for storage, the hash's attribute and value must appear in pairs. For example: odpsReader: "column":[ "id", "name", "age", "address" ]. In the destination, RedisWriter is configured with `"keyIndexes":[ 0, 1]`. In Redis, `id` and `name` are used as the key, `age` as the attribute, and `address` as the value, stored in a hash type. If the ODPS source is configured with only two columns, you cannot use hash mode to store the Redis cache, and this exception is reported.

  • Solution:

    If you want to use only two columns, configure Redis to store the information in String mode. If you must use hash mode, configure at least three columns in the source.

OSS

When reading a CSV file with multiple delimiters, how do I handle dirty data?

  • Symptom:

    When you configure an offline sync task to read data from file storage such as OSS or FTP, if the file is in CSV format and uses multiple characters as a column delimiter (for example, |,, ##, or ;;), the task may fail with a "dirty data" error. In the task's run log, you will see a similar IndexOutOfBoundsException (array-index out of bounds) error, and dirty data is generated.

  • Cause analysis:

    DataWorks' built-in csv reader ("fileFormat": "csv") has limitations when it handles delimiters that are composed of multiple characters, which leads to inaccurate column splitting of data rows.

  • Solution:

    • Codeless UI: Switch the text type to text and explicitly specify your multi-character delimiter.

    • Code editor: Change "fileFormat": "csv" to "fileFormat": "text" and correctly set the delimiter: "fieldDelimiter":"<multi-delimiter>", "fieldDelimiterOrigin":"<multi-delimiter>".

Is there a limit on the number of files that can be read from OSS?

Offline synchronization itself does not limit the number of files that the OSS reader plugin can read. The main limitation on file reading comes from the resource consumption (CU) of the task itself. Reading too many files at once can easily lead to an out-of-memory error. Therefore, we recommend that you do not set the object parameter to `*` to prevent the OutOfMemoryError: Java heap space error.

When writing to OSS, how do I remove the random string that appears in the filename?

OSS Writer uses filenames to simulate a directory structure. OSS has the following restrictions on object names: if you use `"object": "datax"`, the written object starts with `datax`, and a random string is appended as a suffix. The number of files is determined by the actual number of sharded tasks.

If you do not want the random UUID suffix, set `"writeSingleObject" : "true"`. For more information, see the `writeSingleObject` description in the OSS data source documentation.

An error occurs when reading data from OSS: AccessDenied The bucket you access does not belong to you.

  • Cause:

    The AccessKey account configured for the data source does not have the required permissions on the bucket.

  • Solution:

    Grant the read permission on the bucket to the AccessKey account that is configured for the OSS data source.

Hive

An offline sync task to a local Hive instance fails with the error: Could not get block locations.

  • Cause analysis:

    The mapred.task.timeout parameter may be set to a value that is too short, which causes Hadoop to terminate the task and clean up the temporary directory. This results in the temporary data not being found.

  • Solution:

    In the data source settings for an offline sync task, if you select Read Data Based On Hive JDBC (supports Conditional Filtering) for the Read Hive Method, you can set the mapred.task.timeout parameter in the Session Configuration field. For example: mapred.task.timeout=600000.

DataHub

When writing to DataHub, how do I handle a write failure caused by exceeding the single-write data limit?

  • Error message:

    ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Write data failed.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42)

  • Possible cause:

    The error is caused because the amount of data submitted to DataHub by DataX in a single batch exceeds DataHub's limit. The configuration parameters that affect the amount of data submitted to DataHub are mainly:

    • maxCommitSize: The maximum size of the data buffer in megabytes (MB) before data is submitted to the destination in a batch. The default value is 1 MB (1,048,576 bytes).

    • batchSize: The number of data records that are accumulated in the DataX-On-Flume buffer. When the number of accumulated records reaches the specified `batchSize`, the records are submitted to the destination as a batch.

  • Solution:

    Reduce the maxCommitSize and batchSize parameters.

LogHub

When reading from LogHub, a field with data is synchronized as empty

This plugin is case-sensitive for field names. Check the column configuration of the LogHub reader.

Data is missing when reading from LogHub

Data Integration uses the time the data entered LogHub. In the LogHub console, check whether the `receive_time` metadata field is within the time range configured for the task.

When reading from LogHub, the mapped fields are not as expected

If this occurs, manually edit the column in the interface.

Lindorm

When using the Lindorm bulk method to write data, is historical data replaced every time?

The logic is the same as the API write method. Data in the same row and column is overwritten, while other data remains unchanged.

Elasticsearch

How do I query all fields under an ES index?

You can use a curl command to obtain the ES index mapping, and then extract all fields from the mapping.

  • Query Shell command:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'
  • Obtain fields from the result:

    {
        "indexname": {
            "mappings": {
                "typename": {
                    "properties": {
                        "field1": {
                            "type": "text"
                        },
                        "field2": {
                            "type": "long"
                        },
                        "field3": {
                            "type": "double"
                        }
                    }
                }
            }
        }
    }

    In the returned result, the `properties` section contains all the fields and their property definitions for the index. The index above contains three fields: `field1`, `field2`, and `field3`.

When synchronizing data offline from ES to other data sources, the index name changes daily. How do I configure this?

You can add date scheduling variables to the index configuration. This allows the index string to be calculated based on different dates, which enables automatic changes to the Elasticsearch Reader's index name. The configuration process includes three steps: defining date variables, configuring the index variable, and publishing and executing the task.

  1. Define date variables: In the sync task's scheduling configuration, select Add Parameter to define date variables. In the example below, `var1` is configured to represent the task execution time (current day), and `var2` represents the task's data timestamp (previous day). 定义日期变量

  2. Configure the index variable: Switch the task to the code editor and configure the Elasticsearch Reader's index. The configuration format is: `${variable_name}`, as shown in the figure below.配置索引变量

  3. Publish and execute the task: After validation, submit and publish the task to the Operation Center, and run it using a recurring schedule or data backfill.

    1. Click the Run With Parameters button to run the task for validation. This action replaces the scheduling system parameters in the task configuration. After the task is complete, check the logs to verify that the synchronized index is as expected.

      Note

      When you run with parameters, directly enter the parameter values for testing.

      运行运行

    2. If the validation is successful, you can click Save and Submit to add the sync task to the production environment.提交任务

      For a standard project, click the Publish button to open the publishing center and publish the sync task to the production environment.发布

  4. Result: The following figure shows the configuration and the actual index result from the run.

    Script index configuration: "index": "esstress_1_${var1}_${var2}".

    Running index replaced as: esstress_1_20230106_20230105.

    运行结果

How does Elasticsearch Reader synchronize properties of Object or Nested fields? (e.g., synchronizing object.field1)

To synchronize object field properties, you must use the code editor. In the code editor, configure `multi` as follows, and configure the column using the `property.sub-property` format.

"multi":{
   "multi":true 
 }

You can refer to the following example for configuration:

#Example:
##Data in ES
"hits": [
    {
        "_index": "mutiltest_1",
        "_type": "_doc",
        "_id": "7XAOOoMB4GR_1Dmrrust",
        "_score": 1.0,
        "_source": {
            "level1": {
                "level2": [
                    {
                        "level3": "testlevel3_1"
                    },
                    {
                        "level3": "testlevel3_2"
                    }
                ]
            }
        }
    }
]
##Reader configuration
"parameter": {
  "column": [
      "level1",
      "level1.level2",
      "level1.level2[0]"
  ],
  "multi":{
        "multi":true
    }
}
##Writer result: 1 row with 3 columns. The column order is the same as the reader configuration.
COLUMN              VALUE
level1:             {"level2":[{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]}
level1.level2:      [{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]
level1.level2[0]:   {"level3":"testlevel3_1"}

After synchronizing a string type from ODPS to ES, the surrounding quotes are missing. How do I fix this? Can a JSON-type string from the source be synchronized as a NESTED object in ES?

  1. The extra double quotes around the characters are a display issue in the Kibana tool. The actual data does not have these surrounding double quotes. You can use a curl command or Postman to view the actual data. The curl command to obtain the data is as follows:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'

    结果

  2. You can configure the ES write field type as `nested` to synchronize JSON-type string data from ODPS to ES and store it in nested format. The following example synchronizes the `name` field to ES in nested format.

    • Sync configuration: Configure the type of `name` as nested.同步配置

    • Sync result: `name` is a nested object type.同步结果

If the source data is the string "[1,2,3,4,5]", how can I synchronize it to ES as an array?

There are two ways to configure writing to ES as an array type. You can choose the appropriate synchronization method based on the source data format.

  • Write to ES as an array type by parsing the source data in JSON format. For example, if the source data is `"[1,2,3,4,5]"`, configure json_array=true to parse the source data and write it to the ES field as an array. Set ColumnList to json_array=true.

    • Codeless UI configuration:Codeless UI configuration

    • Code editor configuration:

      "column":[
        {
          "name":"docs",
          "type":"keyword",
          "json_array":true
        }
      ]
  • Write to ES as an array type by parsing the source data with a delimiter. For example, if the source data is `"1,2,3,4,5"`, configure the delimiter `splitter=","` to parse and write it to the ES field as an array.

    • Limitations:

      • Only one type of delimiter is supported per task. The splitter is globally unique. You cannot configure different delimiters for multiple Array fields. For example, if the source columns are col1="1,2,3,4,5" , col2="6-7-8-9-10", you cannot configure a separate splitter for each column.

      • The splitter can be a regular expression. For example, if the source column value is `"6-,-7-,-8+,*9-,-10"`, you can configure `splitter:".,."`. This is supported in the codeless UI.

    • Codeless UI configuration: 脚本模式配置splitter: Defaults to "-,-"

    • Code editor configuration:

      "parameter" : {
            "column": [
              {
                "name": "col1",
                "array": true,
                "type": "long"
              }
            ],
            "splitter":","
      }

When writing data to ES, a submission is made without a username, but username validation is still required, causing the submission to fail and all requested data to be logged. This generates many audit logs daily. How do I handle this?

  • Cause analysis:

    HttpClient is pre-configured to first make a request without credentials each time a connection is established. After it receives the authorization requirement and determines the authentication method based on the response, it then makes a request with credentials. A connection needs to be established each time data is written to ES, so there is one request without credentials for each write. This causes each data write to be recorded in the audit log.

  • Solution:

    You need to add the "preemptiveAuth":true configuration in the code editor.

How do I synchronize data to ES as a Date type?

There are two ways to configure date writing. You can choose as needed.

  • Directly write the content of the field read by the reader to the ES Date field:

    • Configure `origin:true` to write the read content directly to ES.

    • Configure `"format"` to specify the format attribute for the field when you create the mapping using ES write.

      "parameter" : {
          "column": [
              {
                  "name": "col_date",
                  "type": "date",
                  "format": "yyyy-MM-dd HH:mm:ss",
                  "origin": true
              }
                ]
      }
  • Time zone conversion: If you need Data Integration to perform time zone conversion, add the Timezone parameter.

    "parameter" : {
        "column": [
            {
                "name": "col_date",
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss",
                "Timezone": "UTC"
            }
              ]
    }

Elasticsearch Writer fails to write data due to a specified external version. How do I handle this?

  • If you have configured `type:version`, ES currently does not support specifying an external version.

        "column":[
                                {
                                    "name":"id",
                                    "type":"version"
                                },
      ]
  • Solution:

    You need to remove the `"type":"version"` configuration. Elasticsearch Writer does not currently support specifying an external version.

An offline sync task reading from Elasticsearch fails with the error: ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:

  • Cause analysis:

    Elasticsearch Reader cannot parse the date format of a date type field because the mapping for the corresponding date field in the ES data source does not have a format configured.

  • Solution:

    • Configure the `dateFormat` parameter with the same format as the ES date field's format, separated by "||". The format must include all date type formats. For example:

      "parameter" : {
            "column": [
           			"dateCol1",
              	"dateCol2",
                "otherCol"
            ],
           "dateFormat" : "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss",
      }
    • Set the mapping for all date fields in the ES database to a format.

An offline sync task reading from Elasticsearch fails with the error: com.alibaba.datax.common.exception.DataXException: Code:[Common-00].

  • Cause analysis:

    Due to fastjson keyword restrictions, there may be keywords like `$ref` in the index or column.

  • Solution:

    Elasticsearch Reader does not currently support synchronizing indexes with the keyword `$ref` in their fields. For more information, see Elasticsearch Reader.

An offline sync task writing to Elasticsearch fails with the error: version_conflict_engine_exception.

  • Cause analysis:

    ES's optimistic locking mechanism was triggered. The current version number should be xxx, but the update command passed a different version number, which caused a version conflict. While data was being updated, another process was deleting index data.

  • Solution:

    1. Check and confirm whether a data deletion behavior occurred.

    2. Change the task synchronization method from Update to Index.

An offline sync task writing to Elasticsearch fails with the error: illegal_argument_exception.

  • Cause analysis:

    When you configure advanced properties such as `similarity` or `properties` for a Column field, `other_params` is required for the plugin to recognize them.原因

  • Solution:

    Configure other_params in the Column, and add `similarity` within other_params, as shown below:

    {"name":"dim2_name",...,"other_params":{"similarity":"len_similarity"}}

An offline sync task from an ODPS Array field to Elasticsearch fails with the error: dense_vector

  • Cause analysis:

    Currently, offline synchronization to Elasticsearch does not support the dense_vector type. Only the following types are supported:

    ID,PARENT,ROUTING,VERSION,STRING,TEXT,KEYWORD,LONG,
    INTEGER,SHORT,BYTE,DOUBLE,FLOAT,DATE,BOOLEAN,BINARY,
    INTEGER_RANGE,FLOAT_RANGE,LONG_RANGE,DOUBLE_RANGE,DATE_RANGE,
    GEO_POINT,GEO_SHAPE,IP,IP_RANGE,COMPLETION,TOKEN_COUNT,OBJECT,NESTED;
  • Solution:

    To handle types not supported by Elasticsearch Writer:

    • Do not use Elasticsearch Writer to create the index mapping. Use a customer-created mapping instead.

    • Change the corresponding type to `NESTED`.

    • Modify the configuration to: dynamic = true and cleanup=false.

Why don't the settings configured in Elasticsearch Writer take effect when creating an index?

  • Cause:

    #Incorrect configuration
    "settings": {
      "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      }
    }
    #Correct configuration
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  • Solution:

    Settings take effect only when an index is created. Index creation occurs in two cases: when the index does not exist, and when cleanup=true. When cleanup=true, the Settings configuration does not need to use `"index"`.

Why does a property of the nested type in my self-built index change to the keyword type after automatic generation? (Automatic generation refers to executing a sync task with cleanup=true configured.)

#Original mappings
{
  "name":"box_label_ret",
  "properties":{
    "box_id":{
      "type":"keyword"
    }
}
#After cleanup=true rebuild, it becomes
{
    "box_label_ret": {
      "properties": {
        "box_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }}}}
}
  • Cause analysis:

    For nested types, Elasticsearch Writer uses only the top-level mappings, which allows ES to adapt the composite types below. Changing the property type to `text` and adding `fields:keyword` is an adaptive behavior of ES and does not affect its use. If you are concerned about this mapping format, see Data Integration sync task capabilities.

  • Solution:

    Before synchronization, create the expected ES index mappings, then set the ES sync task's cleanup parameter to false and execute the sync task.

Kafka

When reading from Kafka, endDateTime is configured to specify the data range, but data beyond this time is found in the destination

Kafka Reader reads data in batches. If a batch of data contains records that are beyond the endDateTime, the synchronization stops, but this data that exceeds the endDateTime is also written to the destination data source.

  • You can use the `skipExceedRecord` configuration item to specify whether to synchronize the excess data. For more information about how to use it, see Kafka Reader. [We recommend that you do not set this to not synchronize, as it may cause data loss.]

  • You can configure Kafka's `max.poll.records` parameter to specify the amount of data to pull at one time. Combined with the concurrency, this can control the amount of potentially excess data. The amount of excess data is less than `max.poll.records` × concurrency.

Why does a task with a small amount of data in Kafka keep running for a long time without reading data or finishing?

  • Cause analysis:

    A small amount of data or uneven data distribution causes some Kafka partitions to have no new data, or the new data has not reached the specified end offset for reading. Because the task's exit condition is that all partitions must reach the specified end offset, these "idle" partitions cannot meet the condition, which blocks the normal completion of the entire task.

  • Solution:

    In this case, set the Synchronization End Policy to: Stop if no new data is read for 1 minute (in the code editor, set the stopWhenPollEmpty parameter to true and the stopWhenReachEndOffset parameter to true). This allows the task to exit immediately after reading the latest offset data from all partitions and avoids dry-runs. However, note that if records with timestamps earlier than the configured end offset are written to a partition after the task has finished, these records will not be consumed.

RestAPI

RestAPI Writer reports an error: The JSON string found through path:[] is not an array type

RestAPI Writer provides two write modes. When you synchronize multiple data records, you need to configure `dataMode` as `multiData`. For more information, see RestAPI Writer. You also need to add the parameter `dataPath:"data.list"` in the RestAPI Writer script.参数

Important

You do not need to add the "data.list" prefix when you configure Column.

OTS Writer configuration

How do I configure OTS Writer to write data to a target table that contains an auto-increment primary key column?

  1. The OTS Writer configuration must include the following two lines:

    "newVersion": "true",
    "enableAutoIncrement": "true",
  2. The OTS Writer does not need to be configured with the name of the auto-increment primary key column.

  3. The number of primaryKey entries + the number of column entries configured in OTS Writer must equal the number of columns in the upstream OTS Reader data.

Time series model configuration

In a time series model configuration, what do the _tag and is_timeseries_tag fields mean?

Example: A data record has three tags: `[phone=Xiaomi, memory=8G, camera=Leica]`.数据

  • Data exporting example (OTS Reader)

    • If you want to merge the above tags into a single column for export, configure as follows:

      "column": [
            {
              "name": "_tags",
            }
          ],

      DataWorks exports the tags as a single column of data, in the following format:

      ["phone=xiaomi","camera=LEICA","RAM=8G"]
    • If you want to export the phone and camera tags so that each tag is in a separate column, use the following configuration:

      "column": [
            {
              "name": "phone",
              "is_timeseries_tag":"true",
            },
            {
              "name": "camera",
              "is_timeseries_tag":"true",
            }
          ],

      The following two columns of data are obtained:

      xiaomi, LEICA
  • Data import example (OTS Writer)

    Suppose the upstream data source (Reader) has two columns of data:

    • One column of data is: ["phone=xiaomi","camera=LEICA","RAM=8G"].

    • The other column of data is: 6499.

    If you want to add both of these columns to the tags, with the expected tag field format after writing as shown below: 格式Then configure as follows:

    "column": [
          {
            "name": "_tags",
          },
          {
            "name": "price",
            "is_timeseries_tag":"true",
          },
        ],
    • The first column configuration imports ["phone=xiaomi","camera=LEICA","RAM=8G"] as a whole into the tag field.

    • The second column configuration imports price=6499 separately into the tag field.

Custom table names

How do I customize table names for offline sync tasks?

If your table names follow a regular pattern, such as orders_20170310, orders_20170311, and orders_20170312, are distinguished by day, and have consistent table structures, you can combine scheduling parameters with code editor configuration to customize table names. This lets you automatically read the previous day's table data from the source database every morning.

For example, if today is March 15, 2017, the task automatically reads data from the orders_20170314 table and imports it.自定义表名

In the code editor, change the source table name to a variable, for example, orders_${tablename}. Because the table names are distinguished by day and you need to read the previous day's data each day, in the task's parameter configuration, assign the variable as tablename=${yyyymmdd}.

Note

For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Table modification and column addition

How do I handle adding or modifying columns in the source table of an offline sync task?

Go to the sync task configuration page and modify the field mapping to update the changed fields in the task configuration. After that, you need to resubmit and execute the task for the changes to take effect.

Task configuration issues

When configuring an offline sync node, I cannot view all tables. How do I handle this?

When you configure an offline sync node, the Select Source area displays only the first 25 tables in the selected data source by default. If there are more tables, you can search by table name or use the code editor for development.

Table/column name keywords

How do I handle sync task failures caused by keywords in table or column names?

  • Cause: The column contains a reserved word, or the column configuration contains a field that starts with a number.

  • Solution: Switch the Data Integration sync task to the code editor and escape the special fields in the column configuration. For more information about how to configure a task in the code editor, see Configure a sync node in the code editor.

    • The escape character for MySQL is `keyword`.

    • The escape character for Oracle and PostgreSQL is "keyword".

    • The escape character for SQL Server is [keyword].

    MySQL example:字段冲突

  • Take a MySQL data source as an example:

    1. Run the following statement to create a new table named `aliyun`. create table aliyun (`table` int ,msg varchar(10));

    2. Run the following statement to create a view and give an alias to the `table` column. create view v_aliyun as select `table` as col1,msg as col2 from aliyun;

      Note
      • `table` is a MySQL keyword. The code generated during data synchronization causes an error. Therefore, you need to create a view to give an alias to the `table` column.

      • Do not use keywords as table column names.

    3. By running the above statement, you can assign an alias to the column that has a keyword. When you configure the sync task, select the `v_aliyun` view instead of the `aliyun` table.

Field mapping

An offline task fails with the error: plugin xx does not specify column

This error may occur because the field mapping of the sync task is not configured correctly, or the plugin's column is not configured correctly.

  1. Check whether field mapping is configured.

  2. Check whether the plugin's column is configured correctly.

For an unstructured data source, fields cannot be mapped when I click Data Preview. How do I handle this?

  • Symptom:

    When you click Data Preview, the following message is displayed, indicating that a field exceeds the maximum byte size.

    问题现象

  • Cause: To avoid an out-of-memory (OOM) error, the data source service checks the length of fields when it processes a data preview request. If a single column exceeds 1,000 bytes, the above message appears. This message does not affect the normal operation of the task. You can ignore this error and run the offline sync task directly.

    Note

    If the file exists and connectivity is normal, other reasons for data preview failure include the following:

    • The byte size of a single row in the file exceeds the limit of 10 MB. In this case, no data is displayed, similar to the message above.

    • The number of columns in a single row of the file exceeds the limit of 1,000. In this case, only the first 1,000 columns are displayed, and a message is shown at the 1001st column.

Modify TTL

For a synchronized data table, can the TTL only be modified using the ALTER method?

The time-to-live (TTL) is set on the table. There is no such option in the sync task configuration.

Function aggregation

When synchronizing via API, is it supported to use source-side functions (e.g., MaxCompute) for aggregation? For example, if the source table has columns a and b as the primary key for Lindorm

API synchronization does not support using source-side functions. You need to process the data using functions on the source side first, and then import the data.