All Products
Search
Document Center

DataWorks:Batch synchronization FAQ

Last Updated:Jun 30, 2026

This topic addresses common questions about batch synchronization.

Overview

You can find common issues and their solutions by matching keywords.

Category

Keyword

Related Topic

Common O&M issues for batch synchronization tasks

Network communication issues

Why does a data source pass the connectivity test but an offline sync task fails with a data source connection error?

Switch resource groups

How do I switch the resource group for an offline sync task?

Dirty data

Run timeout

How do I troubleshoot long-running offline sync tasks?

Slow sync caused by a missing index in the WHERE condition of a data sync task

Whether default values of source tables are retained

Are default values and NOT NULL constraints retained in the destination table created by Data Integration?

Split key

Can a composite primary key be used as a split key for offline sync tasks?

Data loss

Data inconsistency between the destination table and source table after data sync

Non-plugin error causes and solutions

Dirty data

How do I handle dirty data errors caused by encoding format or garbled characters?

SSRF attacks

How do I handle the error "Task have SSRF attacks"?

Network communication issues

Offline sync task intermittently succeeds or fails

Table/column name keywords

How do I handle sync task failures caused by reserved keywords in table or column names?

Adding columns to a table

How do I handle column additions in the source table for offline sync tasks?

Date writing

How do I preserve milliseconds or specify a custom date-time format when writing date-time data to text?

Plugin-specific error causes and solutions

MongoDB

OSS

Is there a file count limit when reading OSS files?

DataHub

How do I handle write failures caused by exceeding the data limit when writing to DataHub?

Lindorm

Does writing data using the Lindorm bulk method always replace historical data?

Elasticsearch

How do I query all fields in an Elasticsearch index?

OTS Writer configuration

How do I configure OTS Writer to write data to a destination table with auto-increment primary key columns?

Time series model configuration

How do I understand the _tag and is_timeseries_tag fields in time series model configuration?

Batch synchronization scenarios and solutions

Custom table names

How do I customize table names for offline sync tasks?

MaxCompute

Task configuration issues

How do I handle the issue of not being able to view all tables when configuring an offline sync node?

LogHub

Kafka

OSS

MySQL

Modify TTL

Can the TTL of a synced data table only be modified by using the ALTER statement?

Function aggregation

Does API-based sync support using source-side functions (such as MaxCompute functions) for aggregation?

Elasticsearch

Field mapping

How do I handle field mapping issues when data preview is unavailable for unstructured data sources?

Error messages and solutions

Resource configuration issues

OSS

Error when reading OSS data: AccessDenied The bucket you access does not belong to you

Redis

Error when writing to Redis in hash mode: Code:[RedisWriter-04] source column number is invalid

PostgreSQL

Error when reading PostgreSQL data: FATAL: terminating connection due to conflict with recovery

MySQL

Instance run conflicts

Offline task error: Duplicate entry 'xxx' for key 'uk_uk_op'

Network communication issues

Offline sync task with MySQL data source error: Communications link failure

Field mapping

Offline task error: plugin xx does not specify column

MaxCompute

RestAPI

RestAPI Writer error: The JSON string found by path is not an array type

RDS

Error when the offline sync source is Amazon RDS: Host is blocked

MongoDB

Elasticsearch

Hive

Error when syncing data to local Hive offline: Could not get block locations

Run timeout

Offline sync task with MongoDB source error: MongoExecutionTimeoutException: operation exceeded time limit

Network connectivity

Why does the data source connectivity test succeed, but the batch synchronization task fails with a data source connection error?

  • If the connectivity test previously succeeded, test again to confirm that the resource group and database are currently connected (and that no changes have been made on the database side).

  • Check whether the resource group that passed the connectivity test is the same one used to run the task.

    Check the resource group used by the task:

    • If the task runs on the default resource group, the logs contain the following information: running in Pipeline[basecommon_ group_xxxxxxxxx]

    • If the task runs on an exclusive resource group for Data Integration, the logs contain the following information: running in Pipeline[basecommon_S_res_group_xxx]

    • If the task runs on a serverless resource group, the logs contain the following information: running in Pipeline[basecommon_Serverless_res_group_xxx]

  • If the task occasionally fails during early-morning scheduling but succeeds after a rerun, check the database load at the time the error occurred.

Batch synchronization task intermittently succeeds and fails

If a batch synchronization task intermittently fails, the cause may be an incomplete allowlist configuration. Check whether the database allowlist is fully configured.

When using an exclusive resource group for Data Integration:

  • If you previously added the elastic network interface (ENI) IP addresses of the exclusive resource group for Data Integration to the data source allowlist, and the resource group has since been scaled out, update the data source allowlist to include the ENI IP addresses of the scaled-out resource group.

  • To avoid having to update the allowlist every time the resource group scales out, we recommend that you add the CIDR block of the vSwitch associated with the exclusive resource group for Data Integration as the database allowlist. For more information, see Add an allowlist.

When using a serverless resource group: refer to Network connectivity of a serverless resource group to check the resource group allowlist configuration and ensure the network is properly configured.

If the allowlist is correctly configured, check whether the database load is too high, which may cause connections to be interrupted.

Resource settings

Batch synchronization task fails with error: [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.

  • Possible cause:

    The concurrency is set too high, resulting in insufficient resources.

  • Solution:

    Reduce the concurrency of the batch synchronization task.

Batch synchronization task fails with error: OutOfMemoryError: Java heap space

If the preceding error occurs, perform the following steps:

  1. If the plugin configuration supports parameters such as batchsize or maxfilesize, reduce the corresponding values.

    You can check whether each plugin supports the preceding parameters. Go to the Supported data sources and readers/writers topic and click the corresponding plugin to view parameter details.

  2. Reduce the concurrency.

  3. If you are synchronizing files, such as OSS files, reduce the number of files to read.

  4. In the Running Resources section of the task configuration, increase the Resource Usage (CU) value appropriately. Set the CU value carefully to avoid affecting other running tasks.

Instance run conflicts

Batch synchronization task fails with error: Duplicate entry 'xxx' for key 'uk_uk_op'

  • Error message: Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op'.

  • Possible cause: Data Integration does not allow different instances of the same node (that is, synchronization tasks with the same JSON configuration) to run at the same time. For example, if a synchronization task runs on a 5-minute schedule, and upstream delays cause both the 00:00 instance and the 00:05 instance to be triggered at 00:05, one of the instances cannot be started. This can also happen when you backfill data or rerun an instance while the task instance is still running.

  • Solution: Stagger instance run times. For tasks scheduled at hourly or minute-level intervals, we recommend that you set self-dependency so that the current instance starts only after the instance from the previous cycle completes. For configuration in legacy Data Studio, see Self-dependency. For configuration in the new Data Studio, see Configure self-dependency.

Run timeout

Batch synchronization task with MongoDB as the source fails with error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

  • Error details: During a data synchronization task, the task fails with the following error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

  • Possible cause: The full data pull is too large.

  • Solution:

    • Increase the concurrency.

    • Reduce the BatchSize.

    • Add the cursorTimeoutInMs configuration in the Reader parameter section, and set it to a large value, such as 3600000 ms.

Batch synchronization task with MySQL as the data source fails with connection timeout error: Communications link failure

  • Read error

    • Symptom:

      When reading data, the following error occurs: Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

    • Possible cause:

      The database executes SQL queries slowly, causing a MySQL read timeout.

    • Solution:

      • Check whether a where filter condition is configured and ensure that the filter columns are indexed.

      • Check whether the source table contains too much data. If so, split the task into multiple tasks.

      • Check the logs to find the SQL statement that caused the blocking and consult the database administrator to resolve the issue.

  • Write error

    • Symptom:

      When writing data, the following error occurs: Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More...

    • Possible cause:

      A slow query causes a SocketTimeout. The default SocketTimeout for TDDL connections is 12 seconds. If a SQL statement takes longer than 12 seconds to execute on MySQL, a 4614 error is reported. This error may occur occasionally when the data volume is large or the server is busy.

    • Solution:

      • Wait until the database is stable and then rerun the synchronization task.

      • Contact the database administrator to adjust the timeout value.

How do I troubleshoot a batch synchronization task that takes a long time to run?

Possible cause 1: Execution takes too long

  • Pre-SQL or post-SQL statements (such as preSql and postSql) take too long to execute in the database, causing the task to run slowly.

  • The split key is not properly configured, causing the task to run slowly.

    Batch synchronization uses the split key (splitPk) to shard data and starts concurrent tasks for data synchronization to improve efficiency. (Check the documentation for each specific plugin to determine whether a split key needs to be configured.)

Solution 1:

  • If pre-SQL or post-SQL statements are configured, use indexed columns for data filtering.

  • If the split key is supported, configure it properly. The following example uses the MySQL Reader plugin split key configuration:

    • We recommend that you use the primary key of the table as the splitPk because primary keys are typically evenly distributed, which helps avoid data hotspots in the resulting shards.

    • Currently, splitPk supports only integer-based data sharding and does not support string, floating-point, date, or other types. If you specify a non-supported type, single-channel synchronization is used.

    • If splitPk is left empty or not specified, data synchronization uses a single channel to synchronize the table data.

Possible cause 2: Waiting for Data Integration task execution resources

Solution 2: If the logs show a prolonged WAIT status, the exclusive resource group for Data Integration used by the current task does not have enough available concurrency to run the task. For details about the cause and solution, see Troubleshoot resource group concurrency issues.

Note

Because a batch synchronization task is dispatched from a scheduling resource group to a Data Integration execution resource group, a single batch synchronization task consumes one scheduling resource. If a batch synchronization task runs for an extended period without releasing resources, it may block not only other batch synchronization tasks but also other types of scheduled tasks.

What to do when a data synchronization task is slowed by a full table scan due to an unindexed WHERE clause?

  • Scenario example

    The executed SQL is as follows:

    SELECT bid,inviter,uid,createTime FROM `relatives` WHERE createTime>='2016-10-2300:00:00' AND reateTime<'2016-10-24 00:00:00';

    Execution started at 2016-10-25 11:01:24.875 and results started to return at 2016-10-25 11:11:05.489. The synchronization program waited for the database to return the SQL query results, and MaxCompute had to wait a long time before execution could proceed.

  • Root cause analysis

    The createTime column in the WHERE clause is not indexed, causing a full table scan.

  • Solution

    We recommend that the where clause uses indexed columns to improve performance. You can also add indexes as needed.

Switch resource group

How do I switch the execution resource group for a batch synchronization task?

Legacy Data Studio:

You can modify the resource group used for debugging on the batch synchronization task details page in DataStudio. You can also change the Data Integration task execution resource group used during scheduling in Operation Center. For more information, see Switch the Data Integration resource group.

New Data Studio:

You can modify the resource group used for debugging Data Integration tasks in DataStudio. You can also change the Data Integration task execution resource group used during scheduling in Operation Center. For more information, see Switch the Data Integration resource group.

Dirty data

How do I troubleshoot and locate dirty data?

Dirty data definition: If an exception occurs when writing a single record to the destination data source, that record is classified as dirty data. Any record that fails to be written is considered dirty data.

Impact of dirty data: Dirty data is not written to the destination. You can control whether dirty data is allowed and specify the maximum number of dirty data records allowed. By default, Data Integration allows dirty data. You can specify the dirty data threshold when configuring a synchronization task. For more information, see Configure channel control in wizard mode.

  • If the task allows dirty data: The task continues to run when dirty data is generated, but the dirty data is discarded and not written to the destination.

  • Controlling the number of allowed dirty data records:

    • If the allowed dirty data count is set to 0, the task fails and exits when any dirty data is generated.

    • If the allowed dirty data count is set to x, the task fails and exits when the dirty data count exceeds x. If the dirty data count is less than x, the task continues to run, but the dirty data is discarded and not written to the destination.

Dirty data scenario analysis:

  • Scenario 1:

    • Error message: {"message":"Dirty data encountered when writing to the ODPS destination table: An error occurred in the data of field [3]. Please check the data and make corrections, or you can increase the threshold to ignore this record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}]}.

    • How to handle: The log shows the dirty data column. The third column is abnormal.

      • Dirty data is reported by the writer. Check the DDL statement for the destination table. The column size specified for the ODPS table is smaller than the actual data size of the corresponding MySQL column.

      • Data synchronization principle: Data from the source data source must be writable to the destination data source (source and destination types must match, and column size definitions must match). Specifically, the source data type must match the destination data type. For example, VARCHAR data from the source cannot be written to an INT column on the destination. The destination column size must be large enough to accommodate the actual data size of the mapped source column. Source data of types such as LONG, VARCHAR, and DOUBLE can be stored in broader types like string or text on the destination.

      • If the dirty data error message is unclear, copy the entire dirty data record from the log, examine the data, and compare it with the destination data types to identify which column or columns are non-compliant.

      For example:

      {"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}

      byteSize: byte count; index: 25, the 26th column; rawData: the actual value; type: data type.

  • Scenario 2:

    • Error message: DataX reports dirty data when reading null values from MySQL.

    • How to handle: Check whether the data type of the source column with null values matches the mapped destination column type. A type mismatch causes an error. For example, writing a string-type null to an int-type destination column results in an error.

How do I view dirty data?

You can view the task logs and click Detail log url in the logs to obtain the detailed runtime log and dirty data information.View logs

If the amount of dirty data exceeds the limit during a batch synchronization task, is the already synchronized data retained?

The task accumulates the count of dirty data records during execution. Once the count exceeds the configured dirty data threshold, the task immediately terminates.

  • Data retention: Data that was successfully written to the destination before the task terminates is retained. No rollback is performed.

  • Zero-tolerance policy: When the dirty data threshold is set to 0, the system adopts a zero-tolerance policy. This means that the task fails and stops immediately upon detecting the first dirty data record.

How do I handle dirty data errors caused by encoding format settings or garbled characters?

  • Error message:

    If the data contains emoji characters, dirty data errors may occur during synchronization: [13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"}],"type":"writer"} .

  • Possible cause:

    • The database encoding is not set to utf8mb4, which causes errors when synchronizing emoji characters.

    • The source data itself contains garbled characters.

    • The encoding of the database and the client is inconsistent.

    • The browser encoding is different, causing preview failures or garbled characters.

  • Solution:

    Choose the appropriate solution based on the cause of the garbled characters:

    • If the original data contains garbled characters, fix the data before running the synchronization task.

    • If the encoding formats of the database and client are inconsistent, modify the encoding format first.

    • If the browser encoding is inconsistent with the database or client encoding, unify the encoding formats before previewing data.

    You can try the following:

    1. For data sources added in JDBC format, modify utf8mb4 as follows: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    2. For data sources added by instance ID, append the following to the database name: database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    3. Modify the database encoding format to utf8mb4. For example, modify the RDS database encoding format on the RDS console.

      Note

      Command to set the RDS data source encoding format: set names utf8mb4. Command to check the RDS database encoding format: show variables like 'char%'.

Retention of default values

Does Data Integration preserve properties, such as default values and not-null constraints, when it creates a destination table?

When creating a destination table, DataWorks preserves only the column names, data types, and comments from the source table. It does not preserve default values, constraints (including not-null constraints and indexes).

Split key

Can a composite primary key be used as a split key in a batch synchronization task?

Batch synchronization tasks do not support using a composite primary key as a split key.

Missing data

Data synchronization completes but the destination table data is inconsistent with the source table data

If data quality issues occur after data synchronization, refer to Troubleshoot data quality issues after synchronization for detailed troubleshooting.

SSRF attacks

Task has SSRF attacks Task have SSRF attacks How do I handle this?

Q: How do I handle the error "Task have SSRF attacks"?

Cause: To ensure cloud security, DataWorks prohibits tasks from accessing internal cloud network addresses through public IP addresses. When a URL in the plugin configuration (such as HTTP Reader) points to an internal IP address or VPC domain name, this security check is triggered.

Correct approach:

Solution: For tasks that access internal data sources, stop using the shared resource group and switch to a secure serverless resource group (recommended) or an exclusive resource group for Data Integration.

Date writing

How do I preserve milliseconds or specify a custom date-time format when writing date-time data to text?

Switch the synchronization task to script mode and add the following configuration in the setting section of the task configuration page:

"common": {
  "column": {
    "dateFormat": "yyyyMMdd",
    "datetimeFormatInNanos": "yyyyMMdd HH:mm:ss.SSS"
  }
}

image.png

Where:

  • dateFormat specifies the date format used when converting source DATE (without time) type data to text.

  • datetimeFormatInNanos specifies the date format used when converting source DATETIME/TIMESTAMP (with time) type data to text. You can specify precision up to milliseconds.

MaxCompute

Notes for adding a row or column in the column mapping when reading MaxCompute (ODPS) table data

  1. You can enter constants. The values must be enclosed in single quotation marks, such as 'abc' and '123'.

  2. You can use scheduling parameters, such as '${bizdate}'. For information about how to use scheduling parameters, see Configure scheduling parameters.

  3. You can enter the partition columns to synchronize, such as pt.

  4. If the entered value cannot be parsed, the type is displayed as 'Custom'.

  5. ODPS functions are not supported.

  6. If a manually added column is displayed as Custom (for example, a MaxCompute partition column or a LogHub column not shown in the data preview), it does not affect actual task execution.

How do I synchronize partition columns when reading MaxCompute (ODPS) table data?

In the column mapping list, click Add or Create Field under the source table columns, enter the partition column name (such as pt), and configure the mapping to the destination table column.

image

How do I synchronize data from multiple partitions when reading MaxCompute (ODPS) table data?

Specify the partition information for the data to read.

  • ODPS partition configuration supports Linux shell wildcards: * matches zero or more characters, and ? matches any single character.

  • By default, the specified partition must exist. If the partition does not exist, the task fails. If you want the task to succeed even when the partition does not exist, set When partitions do not exist, to: ignore non-existing partitions and execute the task normally. Alternatively, switch to script mode and add "successOnNoPartition": true in the ODPS Parameter section.

For example, if the partitioned table test has four partitions: pt=1,ds=hangzhou, pt=1,ds=shanghai, pt=2,ds=hangzhou, and pt=2,ds=beijing, the configurations for reading different partitions are as follows:

  • To read data from the pt=1,ds=hangzhou partition, set the partition information to "partition":"pt=1,ds=hangzhou".

  • To read data from all partitions under pt=1, set the partition information to "partition":"pt=1,ds=*".

  • To read data from all partitions of the test table, set the partition information to "partition":"pt=*,ds=*".

You can also set conditions for retrieving partition data based on your requirements (the following operations require script mode):

  • To specify the maximum partition, add the following configuration: /*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR).

  • To filter by condition, add the relevant condition with the /*query*/ pt+expression configuration. For example, /*query*/ pt>=20170101 and pt<20170110 retrieves all data from the pt partition from 20170101 (inclusive) to 20170110 (exclusive).

Note

/*query*/ indicates that the content following it is recognized as a WHERE condition.

How to implement column filtering, reordering, and null filling for MaxCompute

By configuring MaxCompute Writer, you can implement column filtering, reordering, and null filling operations that MaxCompute itself does not support. For example, to import all columns, configure "column": ["*"].

If the MaxCompute table has three columns a, b, and c, and you want to synchronize only columns c and b, configure the column list as "column": ["c","b"]. This means the first and second columns from the Reader are imported into the c and b columns of the MaxCompute table, and the newly inserted column a in the MaxCompute table is set to null.

Handling MaxCompute column configuration errors

To ensure data write reliability and avoid data quality issues caused by extra column data loss, MaxCompute Writer reports an error if extra columns are written. For example, if the MaxCompute table has columns a, b, and c, and MaxCompute Writer attempts to write more than three columns, it reports an error.

MaxCompute partition configuration notes

MaxCompute Writer supports writing only to the last-level partition and does not support partition routing based on a column. If a table has three levels of partitions, you must specify the exact third-level partition in the partition configuration. For example, to write data to the third-level partition, configure it as pt=20150101, type=1, biz=2. You cannot configure it as pt=20150101, type=1 or pt=20150101.

MaxCompute task rerun and failover

MaxCompute Writer ensures write idempotency by configuring "truncate": true. When a write fails and is rerun, MaxCompute Writer clears the previous data and imports new data, ensuring data consistency after each rerun. If the task is interrupted due to other exceptions during execution, data atomicity is not guaranteed. Data is not rolled back or automatically rerun. You need to leverage the idempotency feature to rerun the task to ensure data completeness.

Note

When truncate is set to true, all data in the specified partition or table is cleared. Use this setting with caution.

Reading MaxCompute (ODPS) table data fails with error: The download session is expired.

  • Error message:

    Code:DATAX_R_ODPS_005:Failed to read ODPS data, Solution:[Please contact the ODPS administrator]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired.

  • Possible cause:

    When batch synchronization reads MaxCompute data, it uses the MaxCompute tunnel command to upload and download data. A Tunnel session has a server-side lifetime of 24 hours. Therefore, if a batch synchronization task runs for more than 24 hours, it fails. For more information about tunnel, see Tunnel overview.

  • Solution:

    Increase the concurrency of the batch synchronization task and plan the data volume appropriately to ensure that the task completes within 24 hours.

Writing to MaxCompute (ODPS) fails with block error: Error writing request body to server

  • Error message:

    Code:[OdpsWriter-09], Description:[Failed to write data to the ODPS destination table.]. - Failed to write block:0 to the ODPS destination table, uploadId=[202012081517026537dc0b0160354b]. Please contact the ODPS administrator for assistance. - java.io.IOException: Error writing request body to server。

  • Possible cause:

    • Possible cause 1: Data type exception, meaning the source data does not conform to the ODPS data type specifications. For example, writing the value 4.2223 to a decimal(18,10) data type in ODPS.

    • Possible cause 2: ODPS block or communication exception.

  • Solution:

    Convert the data types and use data that conforms to the data type specifications.

MySQL

How to synchronize sharded MySQL tables to a single MaxCompute table

You can refer to the following document for configuration: Synchronize sharded MySQL tables to MaxCompute.

How do I handle garbled Chinese characters when synchronizing to a MySQL table with utf8mb4 character set?

Add the data source using a connection string. We recommend modifying the JDBC URL to: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45. For more information, see Add a MySQL data source.

Writing to/reading from MySQL fails with error: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server.

  • Error cause:

    • net_read_timeout: DataX splits MySQL data into multiple equal-sized SELECT statements based on the SplitPk. During execution, one of the SQL statements exceeds the maximum allowed runtime on the RDS side.

    • net_write_timeout: The timeout for waiting to send a block to the client is set too small.

  • Solution:

    Add the parameter to the data source URL connection, set net_write_timeout/net_read_timeout to a larger value, or adjust the parameter in the RDS console.

  • Improvement suggestion:

    If the task can be rerun, set the task to automatically rerun on error.

Data source parameter settings

For example: jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000

Batch synchronization to MySQL fails with error: [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the configured destination table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed

Error cause:

The MySQL parameter wait_timeout defaults to 8 hours. If data is still being fetched when this timeout is reached, the synchronization task is interrupted.

Solution:

Modify the MySQL configuration file my.cnf (or my.ini on Windows). Add the parameter under the MySQL module (in seconds): wait_timeout=2592000 interactive_timeout=2592000. Then restart and log in to MySQL, and run the following statement to verify: show variables like '%wait_time%'.

Reading MySQL database fails with error: The last packet successfully received from the server was 902,138 milliseconds ago

Normal CPU usage but high memory usage may cause the connection to be dropped.

If you confirm that the task can be automatically rerun, we recommend that you enable Auto Rerun on Error. For more information, see Configure auto rerun.

PostgreSQL

Reading PostgreSQL data fails with error: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

  • Scenario: When the batch synchronization tool synchronizes PostgreSQL data, the following error occurs: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

  • Possible cause: This error occurs because it takes too long to pull data from the database. Increase the values of max_standby_archive_delay and max_standby_streaming_delay. For more information, see Standby Server Events.

RDS

Batch synchronization fails when the source is Amazon RDS with error: Host is blocked

When connecting to Amazon RDS and receiving the Host is blocked error, disable the Amazon load balancer health check. After disabling it, the block issue will no longer occur.

MongoDB

Error when adding a MongoDB data source with the root user

The error occurs because when adding a MongoDB data source, you must use a user created in the database that contains the tables to synchronize. You cannot use the root user.

For example, if you want to import the name table, and the name table is in the test database, the database name should be test, and you need to use the username of a user created in the test database.

How do I use a timestamp in the query parameter to implement incremental synchronization when reading MongoDB?

You can use an assignment node to first convert a date-type value to a timestamp, and then pass that value as an input parameter for the MongoDB data synchronization task. For more information, see How to implement incremental synchronization for MongoDB timestamp type columns?

The timezone is shifted by 8 hours after synchronizing MongoDB to a destination data source. How do I handle this?

You need to set the timezone in the MongoDB Reader configuration. For more information, see MongoDB Reader.

Records updated in the source during MongoDB data reading are not synchronized to the destination. How do I handle this?

You can restart the task after a delay without changing the query conditions, that is, delay the task execution time while keeping the configuration unchanged.

Is MongoDB Reader case-sensitive?

When reading data, the Column.name configured by the user is case-sensitive. An incorrect configuration causes the read data to be null. For example:

  • MongoDB source data:

    {
        "MY_NAME": "zhangsan"
    }
  • Column configuration of the synchronization task:

    {
        "column":
        [
            {
                "name": "my_name"
            }
        ]
    }

Because the case of the column configuration does not match the source data, data reading fails.

How do I configure the MongoDB Reader timeout?

The timeout configuration parameter is cursorTimeoutInMs, which defaults to 600000 ms (10 minutes). This parameter specifies the total time that MongoDB Server spends executing the query, excluding data transfer time. If the full data read is large, the following error may occur: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

Reading MongoDB fails with error: no master

Currently, DataWorks synchronization tasks do not support reading data from a secondary node. If you configure a secondary node for reading, the following error occurs: no master.

Reading MongoDB fails with error: MongoExecutionTimeoutException: operation exceeded time limit

  • Root cause analysis:

    Caused by cursor timeout.

  • Solution:

    Increase the value of the cursorTimeoutInMs parameter.

Batch synchronization reading from MongoDB fails with error: DataXException: operation exceeded time limit

Increase the task concurrency and the read BatchSize.

MongoDB synchronization task fails with error: no such cmd splitVector

  • Possible cause:

    By default, the synchronization task uses the splitVector command for task sharding. Some MongoDB versions do not support the splitVector command, which causes the no such cmd splitVector error.

  • Solution:

    1. Go to the synchronization task configuration page and click the Convert to Script Convert to Script button at the top. Change the task to script mode.

    2. In the MongoDB parameter configuration, add the following parameter:

      "useSplitVector" : false

      This avoids using splitVector.

MongoDB batch synchronization fails with error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"

  • Error message:

    In the synchronization task, taking wizard mode as an example, this issue may occur when Write Mode (Overwrite) is set to Yes and a non-_id column is configured as the Business Key.Write mode error

  • Possible cause:

    The data being written contains records where the _id does not match the configured Business Key (such as my_id in the example above).

  • Solution:

    • Option 1: Modify the batch synchronization task to ensure that the configured Business Key is the same as _id.

    • Option 2: Use _id as the business primary key during data synchronization.

Redis

Writing to Redis in hash mode fails with error: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!

  • Cause:

    When Redis uses hash mode for storage, hash attributes and values must appear in pairs. For example: odpsReader: "column":[ "id", "name", "age", "address" ]. In the destination, if RedisWriter is configured as: "keyIndexes":[ 0, 1], then in Redis, id and name serve as the key, age serves as the attribute, and address serves as the value in the hash type. If only two columns are configured on the ODPS source, hash mode cannot be used for Redis storage, and this exception is thrown.

  • Solution:

    If you only want to use two columns, configure the Redis String mode for storage. If you must use hash mode, configure at least three columns on the source side.

OSS

How do I handle dirty data when reading CSV files with multi-character delimiters?

  • Symptom:

    When configuring a batch synchronization task to read data from file storage such as OSS or FTP, if the file is in CSV format and uses multiple characters as the column delimiter (such as |,, ##, or ;;), the task may fail with a dirty data error. In the runtime log, you will see an IndexOutOfBoundsException error along with dirty data.

  • Root cause analysis:

    The built-in csv reader ("fileFormat": "csv") in DataWorks has limitations when processing multi-character delimiters, which causes inaccurate column splitting for data rows.

  • Solution:

    • Wizard mode: Switch the text type to text and explicitly specify the multi-character delimiter.

    • Script mode: Change "fileFormat": "csv" to "fileFormat": "text" and correctly set the delimiter: "fieldDelimiter":"<multi-char delimiter>", "fieldDelimiterOrigin":"<multi-char delimiter>".

Is there a file count limit when reading OSS files?

Batch synchronization itself does not limit the number of files read by the OSS Reader plugin. The main limitation comes from the CU resources consumed by the task. Reading too many files at once can easily cause out-of-memory errors. Therefore, we do not recommend configuring the object parameter as: *, to prevent OutOfMemoryError: Java heap space errors.

How do I remove random strings from file names when writing to OSS?

OSS Writer writes file names by simulating directories using object names. OSS has restrictions on object names. When using "object": "datax", the written objects start with datax, with random string suffixes appended. The number of files is determined by the actual number of split tasks.

If you do not need random UUID suffixes, configure "writeSingleObject" : "true". For more information, see the writeSingleObject parameter description in the OSS Writer documentation.

Reading OSS data fails with error: AccessDenied The bucket you access does not belong to you.

  • Cause:

    The AccessKey configured for the data source does not have permissions on the bucket.

  • Solution:

    Grant read permissions on the bucket to the AccessKey account configured for the OSS data source.

Hive

Batch synchronization to local Hive fails with error: Could not get block locations.

  • Root cause analysis:

    The mapred.task.timeout parameter may be set too low, causing Hadoop to terminate the task and clean up the temporary directory, making the temporary data unavailable.

  • Solution:

    In the data source section of the batch synchronization task, if Hive read methods is set to Read Data Based on Hive JDBC (Supports Conditional Filtering), set the mapred.task.timeout parameter value in Session Configuration, for example, mapred.task.timeout=600000.

DataHub

How do I handle write failures when the data volume in a single write to DataHub exceeds the limit?

  • Error message:

    ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Failed to write data.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42)

  • Possible cause:

    The error occurs because the data volume submitted by DataX to DataHub in a single batch exceeds the DataHub limit. The main configuration parameters that affect the data volume submitted to DataHub are:

    • maxCommitSize: Specifies the accumulated buffer data size. When the accumulated data reaches the maxCommitSize (in MB), it is submitted to the destination in a batch. The default is 1 MB (1,048,576 bytes).

    • batchSize: Specifies the accumulated buffer data record count for DataX-On-Flume. When the accumulated record count reaches the batchSize, data is submitted to the destination in a batch.

  • Solution:

    Reduce the values of the maxCommitSize and batchSize parameters.

LogHub

A column has data in LogHub but is empty after synchronization

This plugin is case-sensitive for column names. Check the column configuration of the LogHub Reader.

Missing data when reading from LogHub

Data Integration uses the time when data enters LogHub. Check the LogHub console to verify whether the metadata column receive_time falls within the time range configured for the task.

Columns read during LogHub column mapping do not match expectations

If this occurs, manually edit the column configuration in the UI.

Why does the __time__ value read fall outside the configured time range, or why does the record count from the console for the same time range differ from the synchronization task?

The start time and end time configured in the batch synchronization task are used by the Reader to call the SLS GetCursor API to locate the start and end cursors. This time is used to locate the read range based on the SLS server-side receive time. The task actually reads data within the cursor range, which is not equivalent to filtering by the output column __time__.

The output column __time__ comes from log.getTime() of each log entry, representing the log's own log time. SLS console queries typically use the query time range, query statements, and index columns for statistics, commonly based on the log time __time__. Therefore, even if the synchronization task and console use the same time values, the __time__ range or record count may differ if the two sides use different time metrics.

Common scenarios:

  1. When log collection or delivery is delayed, historical logs are backfilled, or client clocks are inaccurate, the log time __time__ may be earlier or later than the SLS server-side receive time. The synchronization task locates cursors based on the server-side receive time, while the console queries based on __time__, which may yield different results.

  2. When data is written to another LogStore through SLS data transformation, if the transformation statement does not explicitly set __time__, the target log's __time__ typically retains the source log time rather than the transformation execution time. In this case, the synchronization task may read this batch of data within the time range when the transformation writes to the target LogStore. However, when querying the target LogStore console by the transformation execution time or current time range, these logs may not be found. You need to query by the actual __time__ range of the logs.

  3. When the console query statement, index columns, time range, and the rule filtering statement (SPL) in the synchronization task are inconsistent, the record counts may differ even if the time metrics are the same.

Troubleshooting suggestions:

  1. Verify whether the console query time range, query statement, index columns, and the start/end time and rule filtering statement (SPL) in the synchronization task are consistent.

  2. Include both __time__ (log time) and __tag__:__receive_time__ (the observable field for SLS server-side receive time, which requires this field to exist in the log tags) in the column configuration to compare the log time with the server-side receive time.

  3. If the data comes from SLS data transformation, check whether the transformation statement explicitly sets __time__, and adjust the console query time range in the target LogStore based on the actual __time__.

  4. If strict reconciliation by log time is required downstream, filter or aggregate by __time__ after writing to the destination.

Example: The source log's __time__ is 2026-06-01 10:00:00. An SLS data transformation task writes this log to the target LogStore at 2026-06-12 10:00:00 without explicitly modifying __time__. The target log's __time__ remains 2026-06-01 10:00:00. If the synchronization task's start and end times cover 2026-06-12 10:00:00, the task may read this log. However, when querying the target LogStore in the console around 2026-06-12 10:00:00 with __time__ as the filter, this log may not be found. In this case, adjust the console query time to around 2026-06-01 10:00:00, or explicitly set the target log's __time__ during data transformation as needed.

Why does a column have a value in the LogHub console query but is empty after synchronization?

The Reader matches column names from the actual pulled log content fields, Reader built-in meta-field mappings, and LogTag based on the column configuration. Column names are case-sensitive. If no match is found, null is output without an error.

Common causes include:

  1. The column name configured in column has a different case from the original log field key.

  2. The console displays aliases from query analysis, index fields, or JSON-expanded fields, which differ from the original log key that the Reader actually retrieves.

  3. The column actually comes from LogTag and needs to be configured as __tag__:<tagKey>.

  4. After configuring a rule filtering statement (SPL) or transformation, the output field names do not fully match the column configuration.

When troubleshooting, first check the source table columns and data preview on the visual page to confirm the fields that the Reader actually identifies. In script mode, you can also temporarily set column to ["*"] to see the actual log content field keys retrieved by the Reader, and then configure column based on the original keys.

Lindorm

When using Lindorm bulk mode to write data, is the historical data replaced every time?

The behavior is the same as the API write logic: data in the same row and same column is overwritten, and other data remains unchanged.

Elasticsearch

How do I query all columns in an ES index?

Retrieve the ES index mapping using the curl command and extract all columns from the mapping.

  • Shell command for querying:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'
  • Retrieving columns from the result:

    {
        "indexname": {
            "mappings": {
                "typename": {
                    "properties": {
                        "field1": {
                            "type": "text"
                        },
                        "field2": {
                            "type": "long"
                        },
                        "field3": {
                            "type": "double"
                        }
                    }
                }
            }
        }
    }

    The columns and attribute definitions under properties in the response are all the columns of the index. For example, the index above contains three columns: field1, field2, and field3.

How do I configure the index name when synchronizing data from ES to other data sources with different daily index names?

You can add date scheduling parameters to the index configuration to automatically calculate the index string based on different dates, enabling automatic changes to the Elasticsearch Reader index name. The configuration involves three steps: defining date parameters, configuring index parameters, and deploying and executing the task.

  1. Define date parameters: In the schedule settings of the synchronization task, add parameters to define date parameters. The following var1 configuration represents the task execution time (current day), and var2 represents the business date (previous day). Define date parameters

  2. Configure index parameters: Switch the task to script mode and configure the Elasticsearch Reader index using the format: ${variable_name}, as shown below.Configure index parameters

  3. Deploy and execute the task: After verification, submit and deploy the task to Operation Center, and run it as a periodic schedule or backfill data task.

    1. Click the Running with Parameters button to directly run the task for verification. Running with parameters replaces the scheduling system parameters used in the task configuration. After execution, check the logs to verify whether the synchronized index meets expectations.

      Note

      When running with parameters, enter the parameter values directly for replacement testing.

      RunRun

    2. If the previous step verifies as expected, the task configuration is complete. Click Save and then Commit to submit the synchronization task to the production environment.Submit task

      For a standard mode workspace, click Deploy to go to Deployment Center to deploy the synchronization task to the production environment.Deploy

  4. Result: The following shows the configuration and actual runtime index result.

    Script index configuration: "index": "esstress_1_${var1}_${var2}".

    Runtime index resolved to: esstress_1_20230106_20230105.

    Run results

How does Elasticsearch Reader synchronize properties of Object or Nested fields? (For example, synchronize object.field1)

To synchronize object field properties, you can use only script mode. In script mode, configure multi as follows, and specify column using the attribute.sub-attribute format.

"multi":{
   "multi":true 
 }

Refer to the following example for configuration:

#Example:
##Data in Elasticsearch
"hits": [
    {
        "_index": "mutiltest_1",
        "_type": "_doc",
        "_id": "7XAOOoMB4GR_1Dmrrust",
        "_score": 1.0,
        "_source": {
            "level1": {
                "level2": [
                    {
                        "level3": "testlevel3_1"
                    },
                    {
                        "level3": "testlevel3_2"
                    }
                ]
            }
        }
    }
]
##Reader configuration
"parameter": {
  "column": [
      "level1",
      "level1.level2",
      "level1.level2[0]"
  ],
  "multi":{
        "multi":true
    }
}
##Writer result: 1 row with 3 columns, column order matches reader configuration
COLUMN              VALUE
level1:             {"level2":[{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]}
level1.level2:      [{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]
level1.level2[0]:   {"level3":"testlevel3_1"}

After synchronizing string-type data from ODPS to ES, the quotes appear missing on both sides. How do I handle this? Can a JSON-type string from the source be synchronized as an ES NESTED object?

  1. The extra double quotes displayed before and after characters are a display issue in Kibana. The actual data does not have these leading and trailing double quotes. Use the curl command or Postman to view the actual data. The curl command to retrieve data is as follows:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'

    Results

  2. You can configure the ES write column type as nested to synchronize JSON-type string data from ODPS to ES in nested format. The following example synchronizes the name column to ES in nested format.

    • Synchronization configuration: Configure the type of name as nested.Synchronization configuration

    • Synchronization result: name is a nested object type.Synchronization result

The source data is string "[1,2,3,4,5]". How do I synchronize it to ES as an array?

There are two configuration methods for writing array types to ES. Choose the corresponding synchronization method based on the source data format.

  • Write to ES as an array type by parsing the source data as JSON. For example, if the source data is "[1,2,3,4,5]", configure json_array=true to parse the source data and write it to the ES column as an array. Configure the ColumnList with json_array=true.

    • Wizard mode configuration:Wizard mode configuration

    • Script mode configuration:

      "column":[
        {
          "name":"docs",
          "type":"keyword",
          "json_array":true
        }
      ]
  • Write to ES as an array type by parsing the source data with a delimiter. For example, if the source data is "1,2,3,4,5", configure a delimiter splitter="," to parse and write the data to the ES column as an array.

    • Limitations:

      • A task supports only one delimiter. The splitter is globally unique and does not support different delimiters for different array columns. For example, for source columns col1="1,2,3,4,5" , col2="6-7-8-9-10", the splitter cannot be configured separately for each column.

      • The splitter can be configured as a regular expression. For example, if the source column value is "6-,-7-,-8+,*9-,-10", you can configure splitter:".,." and this is supported in wizard mode.

    • Wizard mode configuration:Script mode configurationsplitter: defaults to "-,-"

    • Script mode configuration:

      "parameter" : {
            "column": [
              {
                "name": "col1",
                "array": true,
                "type": "long"
              }
            ],
            "splitter":","
      }

When writing data to ES, an unauthenticated request is made first, but authentication is still required, causing the request to fail. As a result, all submitted request data is logged, generating a large volume of audit logs daily. How do I handle this?

  • Root cause analysis:

    HttpClient prescribes that each time a connection is established, an unauthenticated request is made first. After the server returns the authentication requirement (specifying the authentication method based on the response), an authenticated request is then made. Since each ES data write requires establishing a connection, every data write generates one unauthenticated request, which is then recorded in the audit logs.

  • Solution:

    Add the "preemptiveAuth":true configuration in script mode.

How do I synchronize data to ES as a Date type?

There are two methods to configure date writing. Choose the appropriate one based on your needs.

  • Write directly to the ES Date column based on the content read from the Reader:

    • Configure origin:true to write the read content directly to ES.

    • Configure "format" to specify the format attribute for the column when creating the mapping through ES write.

      "parameter" : {
          "column": [
              {
                  "name": "col_date",
                  "type": "date",
                  "format": "yyyy-MM-dd HH:mm:ss",
                  "origin": true
              }
                ]
      }
  • Timezone conversion: If you need Data Integration to perform timezone conversion, add the Timezone parameter.

    "parameter" : {
        "column": [
            {
                "name": "col_date",
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss",
                "Timezone": "UTC"
            }
              ]
    }

Elasticsearch Writer fails when specifying an external version. How do I handle this?

  • The type:version is configured, but ES does not support specifying an external version.

        "column":[
                                {
                                    "name":"id",
                                    "type":"version"
                                },
      ]
  • Solution:

    Remove the "type":"version" configuration. Elasticsearch Writer does not support external version specification.

Batch synchronization reading from Elasticsearch fails with error: ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:

  • Root cause analysis:

    Elasticsearch Reader cannot parse the date format of a date-type column because the mapping for the corresponding ES date column does not have a format configured.

  • Solution:

    • Configure the dateFormat parameter with the same format as the ES date column's format, using "||" as the separator. The format must include all date type formats. For example:

      "parameter" : {
            "column": [
           			"dateCol1",
              	"dateCol2",
                "otherCol"
            ],
           "dateFormat" : "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss",
      }
    • Set the mapping format for all date columns in the ES database.

Batch synchronization reading from Elasticsearch fails with error: com.alibaba.datax.common.exception.DataXException: Code:[Common-00].

  • Root cause analysis:

    Due to fastjson keyword limitations, the index or columns may contain keywords such as $ref.

  • Solution:

    Elasticsearch Reader does not support synchronizing indexes that contain the $ref keyword in column names. For more information, see Elasticsearch Reader.

Batch synchronization writing to Elasticsearch fails with error: version_conflict_engine_exception.

  • Root cause analysis:

    This triggered ES's optimistic locking mechanism. The current version number should be one value, but the version number passed by the update command is different, causing a version conflict. During the update, someone else was deleting index data.

  • Solution:

    1. Check whether data deletion operations are occurring.

    2. Change the task synchronization method from Update to Index.

Batch synchronization writing to Elasticsearch fails with error: illegal_argument_exception.

  • Root cause analysis:

    When configuring advanced attributes such as similarity and properties for a column, other_params is required for the plugin to recognize them.Cause

  • Solution:

    Configure other_params in the column configuration, and add similarity inside other_params, as follows:

    {"name":"dim2_name",...,"other_params":{"similarity":"len_similarity"}}

Batch synchronization of ODPS Array column data to Elasticsearch fails with error: dense_vector

  • Root cause analysis:

    Currently, batch synchronization writing to Elasticsearch does not support the dense_vector type. Only the following types are supported:

    ID,PARENT,ROUTING,VERSION,STRING,TEXT,KEYWORD,LONG,
    INTEGER,SHORT,BYTE,DOUBLE,FLOAT,DATE,BOOLEAN,BINARY,
    INTEGER_RANGE,FLOAT_RANGE,LONG_RANGE,DOUBLE_RANGE,DATE_RANGE,
    GEO_POINT,GEO_SHAPE,IP,IP_RANGE,COMPLETION,TOKEN_COUNT,OBJECT,NESTED;
  • Solution:

    For types not supported by Elasticsearch Writer, handle them as follows:

    • We do not recommend using Elasticsearch Writer to create index mappings. Use a custom mapping instead.

    • Change the corresponding type to NESTED.

    • Modify the configuration to: dynamic = true, cleanup=false.

Why doesn't the Settings configuration take effect when Elasticsearch Writer creates an index?

  • Cause:

    #Incorrect configuration
    "settings": {
      "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      }
    }
    #Correct configuration
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  • Solution:

    Settings take effect only when an index is created, which includes two cases: the index does not exist, or cleanup=true. When cleanup=true, the Settings configuration does not need to include "index".

In a custom index, the nested attribute type is keyword, but why does the type become keyword after automatic generation? (Automatic generation refers to executing a synchronization task with cleanup=true)

#Original mappings
{
  "name":"box_label_ret",
  "properties":{
    "box_id":{
      "type":"keyword"
    }
}
#After rebuilding with cleanup=true, it becomes
{
    "box_label_ret": {
      "properties": {
        "box_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }}}}
}
  • Root cause analysis:

    For nested types, Elasticsearch Writer uses only the top-level mappings and lets ES auto-adapt the nested complex types. The attribute type changing to text with an added fields:keyword is ES's auto-adaptation behavior and does not affect ES usage. If you require a specific mapping format, see Elasticsearch Writer.

  • Solution:

    Create the expected ES index mappings before synchronization, then set cleanup to false in the ES synchronization task and execute the task.

Kafka

The endDateTime is configured to specify the cutoff range for data to synchronize from Kafka, but data beyond this time is found in the destination data source

Kafka Reader reads data in batches. In a batch of data, if any records exceed the endDateTime, synchronization stops. However, the data beyond the endDateTime in that batch is still written to the destination data source.

  • You can also use the skipExceedRecord configuration to specify whether to synchronize the exceeded data. For detailed usage, see Kafka Reader. [Setting this to not synchronize is not recommended as it may cause data loss.]

  • You can configure Kafka's max.poll.records parameter to specify the amount of data pulled in each batch. Combined with the concurrency, you can control the amount of data that may exceed the limit. The exceeded data volume < max.poll.records × concurrency.

Why does the task keep running without reading data or finishing when there is little data in Kafka?

  • Root cause analysis:

    When the data volume is small or data is unevenly distributed, some Kafka partitions may have no new data entering or the new data may not reach the specified end offset. Since the task exit condition requires all partitions to reach the specified end offset, these "idle" partitions cannot satisfy the condition, blocking the entire task from completing normally.

  • Solution:

    In this case, set the synchronization end policy to: 1 minute without reading new data (in script mode, set the parameter stopWhenPollEmpty to true and stopWhenReachEndOffset to true) to allow the task to exit directly after reading the latest offset data from all partitions, avoiding idle running. However, note that if records with timestamps earlier than the task's configured end offset are written to a partition after the task ends, those records will not be consumed.

RestAPI

RestAPI Writer fails with error: The JSON string found via path:[] is not in array format

RestAPI Writer provides two write modes. When synchronizing multiple records, set dataMode to multiData. For more information, see RestAPI Writer. You also need to add the parameter dataPath:"data.list" in the RestAPI Writer script.Parameters

Important

When configuring columns, do not add the "data.list" prefix.

OTS Writer configuration

How do I configure OTS Writer when writing data to a destination table that contains an auto-increment primary key column?

  1. The OTS Writer configuration must include the following two requirements:

    "newVersion": "true",
    "enableAutoIncrement": "true",
  2. The auto-increment primary key column name should not be configured in OTS Writer.

  3. The number of primaryKey entries + the number of column entries configured in OTS Writer must equal the number of columns in the upstream OTS Reader data.

Time series model configuration

How to understand the _tag and is_timeseries_tag columns in the time series model configuration?

Example: A data record has three tags: [phone=xiaomi, RAM=8G, camera=LEICA].Data

  • Data export example (OTS Reader)

    • If you want to merge the above tags into a single column for export, configure it as follows:

      "column": [
            {
              "name": "_tags",
            }
          ],

      DataWorks exports the tags as a single column of data in the following format:

      ["phone=xiaomi","camera=LEICA","RAM=8G"]
    • If you want to export the phone tag and the camera tag as separate columns, configure it as follows:

      "column": [
            {
              "name": "phone",
              "is_timeseries_tag":"true",
            },
            {
              "name": "camera",
              "is_timeseries_tag":"true",
            }
          ],

      DataWorks exports two columns of data in the following format:

      xiaomi, LEICA
  • Data import example (OTS Writer)

    The upstream data source (Reader) has two columns of data:

    • One column contains: ["phone=xiaomi","camera=LEICA","RAM=8G"].

    • The other column contains: 6499.

    To add both columns to the tags, the expected tag field format after writing is as follows:FormatConfigure as follows:

    "column": [
          {
            "name": "_tags",
          },
          {
            "name": "price",
            "is_timeseries_tag":"true",
          },
        ],
    • The first column configuration imports ["phone=xiaomi","camera=LEICA","RAM=8G"] as a whole into the tag field.

    • The second column configuration imports price=6499 individually into the tag field.

Custom table name

How do I customize the table name for a batch synchronization task?

If your table names follow a regular pattern, such as orders_20170310, orders_20170311, and orders_20170312, where tables are distinguished by date and share the same structure, you can use scheduling parameters (Configure synchronization tasks in script mode) to customize the table name and automatically read the previous day's table data from the source database every early morning.

For example, if today is March 15, 2017, the system automatically imports data from the orders_20170314 table in the source database, and so on.Custom table name

In script mode, change the source table name to a variable, such as orders_${tablename}. Since tables are distinguished by date and you need to read the previous day's data daily, assign the variable value in the task parameter configuration as tablename=${yyyymmdd}.

Note

For more information about scheduling parameters, see Configure scheduling parameters

Adding columns to a table

How do I handle column additions (modifications) in the source table for batch synchronization?

Go to the synchronization task configuration page, modify the column mappings to update the changed columns in the task configuration, and then resubmit and execute the task for the changes to take effect.

Task configuration issues

How do I handle the situation where I cannot view all tables when configuring a batch synchronization node?

When configuring a batch synchronization node, the Source section displays only the first 25 tables from the selected data source by default. If there are more tables, you can enter the table name to search or use script mode for development.

Table/column name keywords

How do I handle synchronization task failures caused by keyword conflicts in table or column names?

  • Error cause: The column configuration contains reserved keywords, or the column configuration contains columns that start with a number.

  • Solution: Switch the Data Integration synchronization task to script mode and escape the special columns in the column configuration. For configuring tasks in script mode, see Configure synchronization tasks in script mode.

    • The escape character for MySQL is `keyword`.

    • The escape character for Oracle and PostgreSQL is "keyword".

    • The escape character for SQL Server is [keyword].

    MySQL example:Column conflict

  • Taking a MySQL data source as an example:

    1. Run the following statement to create a table named aliyun: create table aliyun (`table` int ,msg varchar(10));

    2. Run the following statement to create a view and assign an alias to the table column: create view v_aliyun as select `table` as col1,msg as col2 from aliyun;

      Note
      • table is a MySQL keyword. During data synchronization, the concatenated code causes an error. You need to create a view and assign an alias to the table column.

      • We do not recommend using keywords as table column names.

    3. After running the above statements, use the v_aliyun view instead of the aliyun table when configuring the synchronization task.

Column mapping

Batch synchronization task fails with error: plugin xx does not specify column

This error may occur because the column mapping of the synchronization task is not correctly configured, or the plugin does not have column properly configured.

  1. Check whether column mapping is configured.

  2. Check whether the plugin has column properly configured.

Unstructured data source: How do I handle the issue where columns cannot be mapped after clicking data preview?

  • Symptom:

    When you click Preview Data, a message similar to the following appears, indicating that the column byte size exceeds the limit.

    Issue symptoms

  • Cause: To prevent OOM, the data source service checks column length when processing data preview requests. If a single column exceeds 1000 bytes, the above message appears. This message does not affect actual task execution. You can ignore this error and run the batch synchronization task directly.

    Note

    If the file exists and connectivity is normal, the following situations can also cause data preview to fail:

    • A single row in the file exceeds the byte size limit of 10 MB. In this case, no data is displayed, similar to the above message.

    • A single row in the file exceeds the column count limit of 1000 columns. In this case, only the first 1000 columns are displayed, with a message shown at the 1001st column.

Modify TTL

Can the TTL of a synchronized table only be modified using the ALTER statement?

TTL is set at the table level. There is no TTL option in the synchronization task configuration.

Function aggregation

When synchronizing via API, does it support using source-side functions (such as MaxCompute) for aggregation? For example, the source table has columns a and b as Lindorm primary keys

API-based synchronization does not support using source-side functions. Process the data using source-side functions first before importing.