This topic provides answers to some frequently asked questions about batch synchronization.
Overview
Why is the network connectivity test of a data source successful, but the batch synchronization node that uses the data source fails to be run?
- If the data source has passed the network connectivity test, you can test the connectivity again to ensure that the resource group that you use is connected to the data source and no changes are made on the data source.
- Check whether the resource group that is connected to the data source is the same as the resource group that you use to run the batch synchronization node. Check the resource group that is used to run a node:
- If the node is run on the shared resource group for Data Integration, the log contains the following information:
running in Pipeline[basecommon_ group_xxxxxxxxx]
. - If the node is run on a custom resource group for Data Integration, the log contains the following information:
running in Pipeline[basecommon_xxxxxxxxx]
. - If the node is run on an exclusive resource group for Data Integration, the log contains the following information:
running in Pipeline[basecommon_S_res_group_xxx]
.
- If the node is run on the shared resource group for Data Integration, the log contains the following information:
- If the node that is scheduled to run in the early morning occasionally fails but reruns successfully, check the load of the data source at the time when the failure occurred.
How do I change the resource group that is used to run a batch synchronization node in Data Integration?
You can go to DataStudio or Operation Center to change the resource group that is used to run a batch synchronization node. For more information, see Create and use an exclusive resource group for Data Integration.
How do I locate and handle dirty data?
Definition: If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to the destination are considered as dirty data.
- Dirty data is allowed for a batch synchronization node: If a dirty data record is generated, the batch synchronization node continues to run. However, the dirty data record is discarded and is not written to the destination.
- The maximum number of dirty data records that are allowed is specified for a batch synchronization node:
- If you set the maximum number of dirty data records that are allowed to 0, the batch synchronization node fails and exits when a dirty data record is generated.
- If you set the maximum number of dirty data records that are allowed to x, the batch synchronization node fails and exits when the number of dirty data records that are generated exceeds x. The batch synchronization node continues to run if the number of dirty data records that are generated is less than x. However, the dirty data records are discarded and are not written to the destination.
- Scenario 1:
- Problem description: The following error message is returned:
{"message":"Dirty data is found in the data that is to be written to the destination MaxCompute table: The [third] field contains dirty data. Check and correct the data, or increase the threshold value for the number of data dirty records and ignore this dirty data record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}
. - Solutions: The logs show that the third field contains dirty data. You can identify the cause of dirty data based on the following scenarios:
- If the error for the dirty data is reported by a Writer plug-in, check the table creation statement in the Writer plug-in. Make sure that the data size of the specified field in the destination MaxCompute table is less than the data size of the same field in the source MySQL table.
- The following requirements must be met before you can synchronize data from the source to the destination: 1. The data type of source columns must match the data type of destination columns. For example, data of the VARCHAR type in source columns cannot be written to the destination columns that are used to store data of the INT type. 2. The size of data defined by the data type of destination columns must be sufficient to receive the data in the mapping columns in the source. For example, you can write data of the LONG, VARCHAR, or DOUBLE type from the source to the columns that are used to store data of a string or text type.
- If the dirty data error is not clear, copy and print out dirty data records, observe the data, and then compare the data type of the records with the data type of destination columns to identify dirty data records.
byteSize: the number of bytes. index: 25: the 26th field. rawData: a specific value. type: the data type.{"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}
- Problem description: The following error message is returned:
- Scenario 2:
- Problem description: An error for dirty data is reported when data that is read from a MySQL data source by DataX contains null values.
- Solution: Check the data type of the destination fields to which the source fields that contain null values are written. If the data type of source fields is different from the data type of destination fields, an error is reported. For example, if STRING-type source fields that contain null values are written to the INT-type destination fields, an error is reported.
How do I view dirty data?

Is all data not synchronized if the number of dirty data records generated during data synchronization exceeds the specified upper limit?
If the number of dirty data records generated during data synchronization exceeds the specified upper limit, the related batch synchronization node stops running. The data that is synchronized to the destination before the batch synchronization node stops running is retained. If you set the upper limit for the number of dirty data records that are allowed during data synchronization to N for a batch synchronization node, a maximum of N dirty data records are allowed during data synchronization. If the upper limit is exceeded, the batch synchronization node stops running. For example, if you set the upper limit to 0, the batch synchronization node stops running when the first dirty data record is detected. In this case, a small number of data records may be synchronized to the destination before the batch synchronization node stops. If the first data record is a dirty data record, no data records can be synchronized.
What do I do if a batch synchronization node runs for an extended period of time?
- The SQL statement that you configure to execute before or after data synchronization takes an extended period of time to execute in databases. This prolongs the period that is required to run the batch synchronization node.
- The shard key is not appropriately configured. As a result, the batch synchronization node is run at a slow speed.
A batch synchronization node uses the shard key that you configure to shard data. Then, Data Integration uses parallel threads to synchronize data based on the shard key. This improves the efficiency of data synchronization. For information about whether a shard key is required by a plug-in, see the related topic for the plug-in.
- When you configure the SQL statement to be executed before or after data synchronization, we recommend that you use indexed fields to filter data.
- If the related plug-in requires a shard key, we recommend that you appropriatrly configure a shard key. For example, when you configure a shard key for MySQL Reader, take note of the following items:
- We recommend that you specify the name of the primary key column of a source table as a shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.
- A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, only a single thread is used to read data.
- If no shard key is specified, only a single thread is used to read data.
Possible cause 2: The batch synchronization node waits for resources.
Solution 2: If the logs show that the node waits for resources for an extended period of time, the number of parallel threads on the related exclusive resource group for Data Integration reaches the upper limit. For more information about the specific causes and solutions, see Why does a data synchronization node in Data Integration wait for resources for an extended period of time?.
How do I handle a dirty data error that is caused by encoding format configuration issues or garbled characters?
- Problem description:
If data contains emoticons, a dirty data error message similar to the following error message may be returned during data synchronization:
[13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"},}],"type":"writer"}
. - Possible causes:
- utf8mb4 is not configured for a data source. As a result, an error is reported when data that contains emoticons is synchronized.
- Data in the source contains garbled characters.
- The encoding format of a data source is different from the encoding format of the client.
- The encoding format of the browser is different from the encoding format of a data source or the client. As a result, the preview fails or the previewed data contains garbled characters.
- Solutions:
The solutions vary based on the cause.
- If data in the source contains garbled characters, process the data before you run the batch synchronization node.
- If the encoding format of the data source is different from the encoding format of the client, modify the configuration for the encoding format of the data source and make sure that the encoding format of the data source is the same as the encoding format of the client.
- If the encoding format of the browser is different from the encoding format of the data source or client, modify the configuration for the encoding format of the browser and make sure that the encoding format is the same as the encoding format of the data source or client. Then, preview the data.
- If your data source is added in connection string mode and you want to add the configuration for the encoding format utf8mb4 for the data source, append the configuration to the related JDBC URL. Example:
jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45
. - If your data source is added in Alibaba Cloud instance mode and you want to add the configuration for the encoding format utf8mb4 for the data source, append the configuration to the name of the related database. Example:
database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45
. - Change the encoding format of the data source to utf8mb4. For example, you can change the encoding format of an ApsaraDB RDS database in the ApsaraDB RDS console. Note You can run the following command to set the encoding format of the ApsaraDB RDS data source to utf8mb4:
set names utf8mb4
. You can run the following command to view the encoding format of an ApsaraDB RDS data source:show variables like 'char%'
.
What do I do if a server-side request forgery (SSRF) attack is detected in a batch synchronization node and the error message Task have SSRF attacts
is returned?
If a private IP address is specified when the data source is added to DataWorks and the data source is used for a batch synchronization node, you cannot run the node on the shared resource group for Data Integration. You must run the node on an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Create and use an exclusive resource group for Data Integration.
What do I do if a batch synchronization node occasionally fails to be run?
- If the node is run on an exclusive resource group for Data Integration, configure the IP address whitelist based on the following instructions:
- If you have added the IP address of the elastic network interface (ENI) of the exclusive resource group for Data Integration to the IP address whitelist of the data source and the resource group is scaled out, you must add the ENI IP address of the resource group after the scale-out to the IP address whitelist.
- We recommend that you directly add the CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated to the IP address whitelist of the data source. Otherwise, each time the resource group is scaled out, you must update the ENI IP address in the IP address whitelist of the data source . For more information, see Configure a whitelist.
- If the node is run on the shared resource group for Data Integration, configure the IP address whitelist based on the following instructions:
Make sure that IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides are added to the IP address whitelist of the data source. For more information, see Add the IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides to the whitelist of a data source.
What do I do if a field is added to or updated in the source table of a batch synchronization node?
Go to the configuration tab of the batch synchronization node and update the mappings between the fields in the source table and the fields in the destination table. Then, commit and run the batch synchronization node for the update to take effect.
What do I do if an error occurs when I use the root user to add a MongoDB data source?
Change the username. You must use the username of the database to which the table you want to synchronize belongs when you add a MongoDB data source. The username of the root user is not supported.
For example, if you want to synchronize data from the name table in the test database, use the username of the test database.
The authDB database used by MongoDB is the admin database. How do I synchronize data from a business database?
Enter the name of a business database when you add a MongoDB data source to DataWorks and make sure that the user that you use has the required permissions on the business database. If the error message "auth failed" is returned when you test the network connectivity of the data source, ignore the error message. If you configure a batch synchronization node by using the code editor, add the "adthDb":"admin" configuration to the JSON configurations of the node.
How do I convert the values of the variables in the query parameter into values of the TIMESTAMP data type when I synchronize incremental data from a table of a MongoDB database?
Use assignment nodes to convert data of the DATE type into data of the TIMESTAMP type and use the timestamp value as an input parameter for data synchronization from MongoDB. For more information, see How do I synchronize incremental data of the TIMESTAMP type from a table of a MongoDB database?.
After data is synchronized from a MongoDB data source to a destination, the time zone of the data is 8 hours ahead of the original time zone of the data. What do I do?
Check the configurations of the cursorTimeoutInMs and query parameters. For information about the parameters, see MongoDB Reader.
What do I do if a batch synchronization node fails to synchronize data changes in a MongoDB data source to a destination?
Restart the node after a period of time without changing the configuration of the query parameter.
Is the number of OSS objects from which OSS Reader can read data limited?
In Data Integration, the number of OSS objects from which OSS Reader can read data is not limited. The maximum number of OSS objects from which data can be read is determined by the JVM parameters that are configured for a batch synchronization node. To prevent out of memory (OOM) errors, we recommend that you do not use an asterisk (*) to specify the objects from which you want to read data.
What do I do if data fails to be written to DataHub because the amount of data that I want to write to DataHub at a time exceeds the upper limit?
- Problem description:
The following error message is returned when I run a batch synchronization node to write data to DataHub:
ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Failed to write data to DataHub.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42)
. - Possible cause:The amount of data that you want to write to DataHub at a time exceeds the upper limit that is allowed by DataHub. The following parameters specify the maximum amount of data that can be written to DataHub at a time:
- maxCommitSize: specifies the maximum amount of the buffered data that Data Integration can accumulate before it transfers the data to the destination. Unit: MB. The default value is 1 MB, which is 1,048,576 bytes.
- batchSize: specifies the maximum number of the buffered data records that a single batch synchronization node can accumulate before it transfers the data records to the destination.
- Solution:
Set the maxCommitSize and batchSize parameters to smaller values.
Is historical data replaced each time I use the bulk mode provided by Lindorm to write data to Lindorm?
Yes, data in the same column and row is replaced, and the other data remains unchanged. The logic is the same as that when an API operation is used to write data.
How do I query all fields in an index of an Elasticsearch cluster?
Use a curl command to query the mappings of an Elasticsearch index and extract all fields from the mappings.
- Sample curl commands that can be used:
//es7 curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping' //es6 curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'
- Sample result that is returned:
{ "indexname": { "mappings": { "typename": { "properties": { "field1": { "type": "text" }, "field2": { "type": "long" }, "field3": { "type": "double" } } } } } }
In the preceding result, all fields in the index and the properties of the fields are contained in the properties parameter. In this example, the index contains field1, field2, and field3.
How do I configure a batch synchronization node to synchronize data from indexes with dynamic names in an Elasticsearch data source to another data source?
When you configure indexes from which you want to read data, you can use date-related variables. This way, the system can calculate index-related strings and enable Elasticsearch Reader to read data from indexes with dynamic names. To configure and run such a batch synchronization node, perform the following steps: Define date-related variables, configure the defined variables in the code to specify indexes, and then deploy and run the node.
- Define date-related variables: When you configure scheduling settings for the batch synchronization node on the configuration tab of the node by using the codeless user interface (UI), you must assign scheduling parameters to variables to define the variables. In the following figure, var1 indicates the date of the scheduling time of the batch synchronization node, and var2 indicates the data timestamp of the batch synchronization node, which is one day before the date of the scheduling time.
- Configure the defined variables in the code to specify indexes: Switch from the codeless UI to the code editor. Configure the defined variables in the code of the batch synchronization node in the ${Variable name} format to specify the indexes from which you want Elasticsearch Reader to read data, as shown in the following figure.
- Deploy and run the batch synchronization node: Test the batch synchronization node, and commit and deploy the node to Operation Center if the test result meets your expectations. You can configure time properties for the batch synchronization node to enable the system to periodically run the node, or use the data backfill feature to run the node.
- Click the Run with Parameters icon in the top toolbar of the configuration tab of the batch synchronization node to test the node. After you click the Run with Parameters icon to test the node, the system replaces the scheduling parameters configured for the batch synchronization node with actual values. After the test is complete, you can view the run logs of the batch synchronization node to check whether the indexes from which data is read meet your expectations. Note You must enter values for the replacement test after you click the Run with Parameters icon.
- If the test result meets your expectations, click the Save and Submit icons in the top toolbar of the configuration tab of the batch synchronization node in sequence to commit the node to the production environment.
If your workspace is in standard mode, click Deploy to deploy the batch synchronization node to the production environment.
- Click the Run with Parameters icon in the top toolbar of the configuration tab of the batch synchronization node to test the node. After you click the Run with Parameters icon to test the node, the system replaces the scheduling parameters configured for the batch synchronization node with actual values. After the test is complete, you can view the run logs of the batch synchronization node to check whether the indexes from which data is read meet your expectations.
- View the result of the batch synchronization node.
Configuration that is used to specify indexes in the code:
"index": "esstress_1_${var1}_${var2}"
Replacement result:
esstress_1_20230106_20230105
How do I specify table names in the configurations of a batch synchronization node?
If you want to synchronize data from the tables that are named in a consistent format, such as the tables that are named by date and have the same table schemas (orders_20170310, orders_20170311, and orders_20170312), specify table names by using the scheduling parameters when you configure the batch synchronization node in the code editor. This way, the batch synchronization node automatically reads table data generated on the previous day from the source every morning. For more information about how to configure a batch synchronization node by using the code editor, see Configure a batch synchronization node by using the code editor.

In the code editor, use a variable to specify the name of a source table, such as orders_${tablename}. The tables are named by date. If you want the batch synchronization node to read data generated on the previous day from the source every day, assign the value ${yyyymmdd} to the ${tablename} variable in the parameter configurations of the node.
For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
What do I do if the table that I want to select is not displayed when I configure a batch synchronization node by using the codeless UI?
When you configure a batch synchronization node by using the codeless UI, the system displays only the first 25 tables in the selected source by default. If the selected source contains more than 25 tables and the table that you want to select is not displayed, you can enter the name of the table in the Table drop-down list to search for the table, or switch to the code editor and configure the batch synchronization node by using the code editor.
What are the items that I must take note of when I use the Add feature to configure a batch synchronization node that synchronizes data from a MaxCompute table?
- You can enter constants. Each constant must be enclosed in single quotation marks ('), such as 'abc' and '123'.
- You can use the Add feature together with scheduling parameters, such as '${bizdate}'. For information about how to use scheduling parameters, see Supported formats of scheduling parameters.
- You can specify the partition key columns from which you want to read data, such as the partition key column pt.
- If the field that you entered cannot be parsed, the value of the Type parameter for the field is displayed as Custom.
- MaxCompute functions are not supported.
- If the value of Type for the fields that you manually added, such as the partition key columns of MaxCompute tables or fields that cannot be previewed in LogHub, is Custom, the batch synchronization node can still be run.
How do I enable a batch synchronization node to synchronize data from a partition key column in a MaxCompute table?
Click Add when you configure mappings between source fields and destination fields, and specify the name of the partition key column, such as pt.
How do I enable a batch synchronization node to synchronize data from multiple partitions in a MaxCompute table?
- You can use Linux Shell wildcards to specify the partitions. An asterisk (*) indicates zero or multiple characters, and a question mark (?) indicates a single character.
- The partitions that you specify must exist in the source table. Otherwise, the system reports an error for the batch synchronization node. If you want the batch synchronization node to be successfully run even if the partitions that you specify do not exist in the source table, use the code editor to modify the code of the node and add
"successOnNoPartition": true
to the configurations of MaxCompute Reader.
- To read data from the partition pt=1,ds=hangzhou, specify
"partition":"pt=1,ds=hangzhou"
. - To read data from all the pt=1 partitions, specify
"partition":"pt=1,ds=*"
. - To read data from all the partitions in the test table, specify
"partition":"pt=*,ds=*"
.
- To read data from the partition that stores the largest amount of data, add
/*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR)
to the configuration of MaxCompute Reader. - To filter data based on filter conditions, add
/*query*/ pt+Expression
to the configuration of MaxCompute Reader. For example,/*query*/ pt>=20170101 and pt<20170110
indicates that you want to read the data that is generated from January 1, 2017 to January 9, 2017 from all the pt partitions in the test table.
MaxCompute Reader processes the content that follows /*query*/
as a WHERE clause.
What do I do if a batch synchronization node fails to be run because the name of a column in the source table is a keyword?
- Cause: The value of the column parameter contains reserved fields or fields whose names start with a number.
- Solution: Use the code editor to configure the batch synchronization node and escape special fields in the configuration of the column parameter. For information about how to configure a batch synchronization node in the code editor, see Configure a batch synchronization node by using the code editor.
- MySQL uses grave accents (`) as escape characters to escape keywords in the following format:
`Keyword`
. - Oracle and PostgreSQL use double quotation marks (") as escape characters to escape keywords in the following format:
"Keyword"
. - SQL Server uses brackets [] as escape characters to escape keywords in the following format:
[Keyword]
.
- MySQL uses grave accents (`) as escape characters to escape keywords in the following format:
- In this example, a MySQL data source is used.
- Execute the following statement to create a table named aliyun, which contains a column named table:
create table aliyun (`table` int ,msg varchar(10));
- Execute the following statement to create a view and assign an alias to the table column:
create view v_aliyun as select `table` as col1,msg as col2 from aliyun;
Note- MySQL uses table as a keyword. If the name of a column in the source table is table, an error is reported during data synchronization. In this case, you must create a view to assign an alias to the table column.
- We recommend that you do not use a keyword as the name of a column.
- Execute the preceding statement to assign an alias to the column whose name is a keyword. When you configure a batch synchronization node for the MySQL data source, use the v_aliyun view to replace the aliyun table.
- Execute the following statement to create a table named aliyun, which contains a column named table:
Why is no data obtained when I run a batch synchronization node to synchronize data from a LogHub data source whose fields contain values?
Field names are case-sensitive for LogHub Reader. Check the field name configuration for LogHub Reader.
Why is some data missing when I run a batch synchronization node to read data from a LogHub data source?
In Data Integration, a batch synchronization node reads data from a LogHub data source at the time when the data is generated in LogHub. Log on to the LogHub console and check whether the value of the metadata field receive_time, which is configured for reading data, is within the time range specified for the batch synchronization node.
What do I do if the LogHub fields that are read based on the field mappings configured for a batch synchronization node are not the expected fields?
Manually modify the configuration of the fields in the DataWorks console.
I configured the endDateTime parameter to specify the end time for reading data from a Kafka data source, but some data that is read is generated at a time point later than the specified end time. What do I do?
- You can configure the skipExceedRecord parameter to specify whether to write such data to the destination. For more information, see Kafka Reader. To prevent data loss, we recommend that you set the skipExceedRecord parameter to false to ensure that the data generated later than the end time is not skipped.
- You can use the max.poll.records parameter in Kafka to specify the amount of data to read at a time. Configure this parameter and the number of parallel threads to control the excess data volume that is allowed. The allowed excess data volume must meet the following condition: Allowed excess data volume < value of max.poll.records × Number of parallel threads.
What do I do if a batch synchronization node used to synchronize data from a Kafka data source does not read data or runs for a long period of time even if only a small amount of data is stored in the Kafka data source?
- Cause:
This issue occurs because the Exit Strategy parameter of the batch synchronization node is set to Exit when reach configured end offset or time. In this case, for all partitions in which the current synchronized offset is within the time range specified by a start offset and an end offset, at least one record whose current synchronized offset is later than or the same as the specified end offset must be read before the batch synchronization node can exit. Otherwise, the batch synchronization node always tries to extract data.
- Solution:
Reduce the number of Kafka partitions from which you want to synchronize data, or write a heartbeat record to each Kafka partition at a regular interval to enable the batch synchronization node to meet exit conditions.
How do I remove the random strings that appear in the data I write to OSS?
When you use OSS Writer to write files to OSS, take note of the name prefixes of the files. OSS simulates the directory structure by adding prefixes and delimiters to file names, such as "object": "datax". This way, the names of the files start with datax and end with random strings. The number of files determines the number of slices into which a batch synchronization node is split.
If you do not want to use a random universally unique identifier (UUID) as the suffix, we recommend that you set the writeSingleObject parameter to true. For more information, see the description of the writeSingleObject parameter in OSS Writer.
For more information about OSS Writer, see OSS Writer.
How do I configure a batch synchronization node to synchronize data from tables in sharded MySQL databases to the same MaxCompute table?
You can configure the batch synchronization node based on the instructions provided in the topic for MySQL Reader. For more information, see MySQL Reader.
What do I do if a full scan for a MaxCompute table slows down data synchronization because no index is added in the WHERE clause?
- Sample scenario The following SQL statement is executed:
select bid,inviter,uid,createTime from `relatives` where createTime>='2016-10-2300:00:00'and reateTime<'2016-10-24 00:00:00';
The batch synchronization node starts to run at 11:01:24.875 on October 25, 2016 and starts to return results at 11:11:05.489 on October 25, 2016. Data Integration is waiting for the database to return SQL query results. However, it takes a long time before MaxCompute can respond.
- Cause
When a query is performed based on the WHERE clause, the createTime column is not indexed, which results in a full table scan.
- Solution
If you use the WHERE clause for a query, we recommend that you use an indexed column or add an index to the column that you want to scan. This can improve the data synchronization performance.
What do I do if the Chinese characters that are synchronized to a MySQL table contain garbled characters because the encoding format of the related MySQL data source is utf8mb4?
If the MySQL data source is added in connection string mode, add information of the encoding format of the MySQL data source to the JDBC URL that is used to connect to the related MySQL database. Example: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45
. For information about how to add a MySQL data source, see Add a MySQL data source.
Can I use a function supported by a source to aggregate fields when I synchronize data by using an API operation? For example, can I use a function supported by a MaxCompute data source to aggregate Fields a and b in a MaxCompute table as a primary key for synchronizing data to Lindorm?
No, functions supported by a source cannot be used to aggregate fields when you synchronize data by using an API operation. You must use a function supported by the source to process the data that you want to synchronize and then synchronize the data to the related destination.
Can I use only the ALTER TABLE statement to modify the time to live (TTL) of a table from which data needs to be synchronized?
Yes, you can use only the ALTER TABLE statement to modify the TTL of a table from which you want to synchronize data. This is because the TTL is configured for a table. No TTL-related settings are provided when you configure a batch synchronization node.
How do I configure Elasticsearch Reader to synchronize the properties of object fields or nested fields, such as object.field1?
"multi":{
"multi":true
}
Sample configuration:#Example:
##Data in an Elasticsearch data source
"hits": [
{
"_index": "mutiltest_1",
"_type": "_doc",
"_id": "7XAOOoMB4GR_1Dmrrust",
"_score": 1.0,
"_source": {
"level1": {
"level2": [
{
"level3": "testlevel3_1"
},
{
"level3": "testlevel3_2"
}
]
}
}
}
]
##Configuration for Elasticsearch Reader
"parameter": {
"column": [
"level1",
"level1.level2",
"level1.level2[0]"
],
"multi":{
"multi":true
}
}
##Data written to the destination: three columns of one row. The order of the columns is the same as the order that is specified in the configuration of Elasticsearch Reader.
COLUMN VALUE
level1: {"level2":[{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]}
level1.level2: [{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]
level1.level2[0]: {"level3":"testlevel3_1"}
What do I do if data of a string type in a MaxCompute data source is enclosed in double quotation marks (") after the data is synchronized to an Elasticsearch data source? How do I configure the JSON strings read from a MaxCompute data source to be written to nested fields in an Elasticsearch data source?
- Data of a string type in MaxCompute is enclosed in double quotation marks (") after the data is synchronized to an Elasticsearch data source due to the data display pattern of the Kibana console. The actual data that is synchronized to the Elasticsearch data source is not enclosed in double quotation marks ("). You can run a curl command or use Postman to query the data that is synchronized to the Elasticsearch data source. Sample curl commands:
//es7 curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping' //es6 curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'
- If you want to configure the JSON strings read from a MaxCompute data source to be written to nested fields in an Elasticsearch data source, you can set the type of the fields to which you want to write the JSON strings in the Elasticsearch data source to nested. The following example shows how to configure the name field read from a MaxCompute data source to be written to a nested field in an Elasticsearch data source.
- Data synchronization configuration: Set the type of the field to which you want to write data in the name field to nested.
- Data synchronization result: Data in the name field is written to a nested field in the Elasticsearch data source.
How do I configure a batch synchronization node to synchronize data such as string "[1,2,3,4,5]"
from a data source to an Elasticsearch data source as an array?
- Configure the batch synchronization node to parse source data in the JSON format: For example, if you want to synchronize the source data [1,2,3,4,5] to an Elasticsearch data source as an array, you can configure the
json_array:true
setting when you configure the destination fields in the Elasticsearch data source. This way, the batch synchronization node can parse the source data as an array.- Configuration in the codeless UI
- Configuration in the code editor
"column":[ { "name":"docs", "type":"keyword", "json_array":true } ]
- Configuration in the codeless UI
- Configure the batch synchronization node to parse source data based on a delimiter: For example, if you want to synchronize the source data 1,2,3,4,5 to an Elasticsearch data source as an array, you can specify a comma (,) as a delimiter for the batch synchronization node. This way, the batch synchronization node can parse the source data as an array based on the delimiter.
- Limits:
- A batch synchronization node supports only one type of delimiter. You cannot specify different delimiters for different fields that you want to write to Elasticsearch as arrays. For example, you cannot specify a comma (,) as a delimiter for the source field col1="1,2,3,4,5" and a hyphen (-) as a delimiter for the source field col2="6-7-8-9-10".
- You can specify a regular expression as a delimiter. For example, if the value of a source field is 6-,-7-,-8+,*9-,-10, you can specify ".,." as a delimiter. The delimiter-related setting can be configured by using the codeless UI.
- Configuration in the codeless UI
The default delimiter is "-,-".
- Configuration in the code editor
"parameter" : { "column": [ { "name": "col1", "array": true, "type": "long" } ], "splitter":"," }
- Limits:
The property type of a field in a self-managed Elasticsearch index is keyword, but the type of the child property of the field is changed to keyword after the related batch synchronization node is run with the cleanup=true
setting configured. Why does this happen?
#Original mappings
{
"name":"box_label_ret",
"properties":{
"box_id":{
"type":"keyword"
}
}
#Mappings after the batch synchronization node is run with the cleanup=true setting configured
{
"box_label_ret": {
"properties": {
"box_id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}}}}
}
- Cause:
When Elasticsearch Writer writes data to Elasticsearch, Elasticsearch Writer writes data only to the mappings of the top layer of property. Elasticsearch automatically changes the type of the top layer of property to text and adds fields:keyword. The addition of the keyword field is an adaptive behavior of Elasticsearch, which does not affect the use of Elasticsearch Writer. If you do not want to use the mapping format provided by Elasticsearch, you can refer to Overview of the solution-based synchronization feature to select a data synchronization solution for data synchronization.
- Solution:
Configure mappings for the Elasticsearch index to which you want to write data and set the cleanup parameter to true before you run the batch synchronization node.
Each time data is written to Elasticsearch, an unauthorized request is sent, and the request fails because the username verification fails. As a result, a large number of audit logs are generated every day because all the requests are logged. What do I do?
- Cause:
Each time an HTTP client establishes communication with a data source, an unauthorized request is sent. After an authorization requirement is returned based on the specified authentication method, an authorized request is sent. Each time data is written to Elasticsearch, communication must be established between the HTTP client and the Elasticsearch data source. Therefore, each time data is written to Elasticsearch, an unauthorized request is sent. As a result, each data write request is recorded in audit logs.
- Solution:
Add the
"preemptiveAuth":true
setting for the related batch synchronization node in the code of the node.
Why do the settings that are configured for Elasticsearch Writer not take effect during the creation of an index?
- Cause:
#Incorrect configuration "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } } #Correct configuration "settings": { "number_of_shards": 1, "number_of_replicas": 0 }
- Solutions:
- The settings take effect only when an index is created. An index can be created in one of the following scenarios: No index exists, or the cleanup parameter is set to
true
. - If the cleanup parameter is set to
true
, remove the index field from the settings.
- The settings take effect only when an index is created. An index can be created in one of the following scenarios: No index exists, or the cleanup parameter is set to
How do I configure a batch synchronization node to synchronize data to fields of a date data type in an Elasticsearch data source?
You can use one of the following methods to configure the related settings:
- Directly write data that is read from the source to fields of a date data type in the Elasticsearch data source:
- Configure the origin:true setting to enable the batch synchronization node to directly write data that is read from the source to the Elasticsearch data source.
- Configure the format parameter. The format parameter specifies the format properties that must be configured for mappings of fields when Elasticsearch Writer creates mappings for the fields.
"parameter" : { "column": [ { "name": "col_date", "type": "date", "format": "yyyy-MM-dd HH:mm:ss", "origin": true } ] }
- Convert the time zone. If you want to enable Data Integration to convert the time zone, configure the Timezone parameter.
"parameter" : { "column": [ { "name": "col_date", "type": "date", "format": "yyyy-MM-dd HH:mm:ss", "Timezone": "UTC" } ] }
What do I do if a write error occurs when the type of a field is set to version in the configuration of Elasticsearch Writer?
- Cause: The type of a field is set to version in the configuration of Elasticsearch Writer. Elasticsearch Writer does not support the type:version setting.
"column":[ { "name":"id", "type":"version" }, ]
- Solution:
Remove the type:version setting.
Error messages
What do I do if the error message [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots
is returned when I run a batch synchronization node to synchronize data?
- Possible cause:
The number of parallel threads for the node is set to an excessively large value, and the resources are not sufficient to run the node.
- Solution:Reduce the number of parallel threads for the node.
- If you configure the node by using the codeless UI, specify a smaller value for the Expected Maximum Concurrency parameter when you configure the channel control policies. For more information, see Configure a batch synchronization node by using the codeless UI.
- If you configure the node by using the code editor, specify a smaller value for the concurrent parameter when you configure the channel control policies. For more information, see Configure a batch synchronization node by using the code editor.
What do I do if the error message OutOfMemoryError: Java heap space
is returned when I run a batch synchronization node to synchronize data?
- If the Reader or Writer plug-in that you use supports the batchsize or maxfilesize parameter, set the batchsize or maxfilesize parameter to a smaller value.
You can check whether a Reader or Writer plug-in supports the batchsize or maxfilesize parameter. For more information, see Supported data source types, Reader plug-ins, and Writer plug-ins.
- Reduce the number of parallel threads for the node.
- If you configure the node by using the codeless UI, specify a smaller value for the Expected Maximum Concurrency parameter when you configure the channel control policies. For more information, see Configure a batch synchronization node by using the codeless UI.
- If you configure the node by using the code editor, specify a smaller value for the concurrent parameter when you configure the channel control policies. For more information, see Configure a batch synchronization node by using the code editor.
- If you synchronize data from files, such as Object Storage Service (OSS) objects, reduce the number of files from which you want to read data.
What do I do if the error message AccessDenied The bucket you access does not belong to you.
is returned when I run a batch synchronization node to synchronize data from OSS?
The user that is configured for the OSS data source and has the AccessKey pair does not have permissions to access the related bucket. Grant the user the permissions to access the bucket.
What do I do if the error message Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!
is returned for storing data written to Redis in hash mode?
- Cause:
If you want to store data in Redis in hash mode, make sure that attributes and values are generated in pairs. For example, you configure
odpsReader: "column":[ "id", "name", "age", "address", ]
for the source and RedisWriter: "keyIndexes":[ 0 ,1] for Redis. In this case, id and name are used as keys, age is used as an attribute, and address is used as a value in Redis. If the source is a MaxCompute data source and only two columns are configured, you cannot store the source data to the Redis cache in hash mode. If you store the source data to the Redis cache in hash mode, an error is reported. - Solution:
If you want to synchronize only two columns, you must store data in Redis in string mode. If you want to store data in hash mode, you must configure at least three columns in the source.
What do I do if the error message org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
is returned when I run a batch synchronization node to synchronize data from PostgreSQL?
- Problem description: The error message
org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
is displayed when I run a batch synchronization node to synchronize data from PostgreSQL. - Cause: This error occurs because the system takes a long time to obtain data from the related PostgreSQL database. To resolve this issue, specify larger values for the max_standby_archive_delay and max_standby_streaming_delay parameters. For more information, see Standby Server Events.
What do I do if the error message Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout、' on the server
is returned when I run a batch synchronization node to read data from or write data to ApsaraDB RDS for MySQL?
- Causes:
- If the error message contains the net_read_timeout parameter, the execution time of an SQL statement exceeds the maximum execution time allowed by ApsaraDB RDS for MySQL. The SQL statement is one of the multiple SQL statements obtained after a single data acquisition SQL statement is equally split based on the setting of the shard key when you run the batch synchronization node to read data from the ApsaraDB RDS for MySQL data source.
- If the error message contains the net_write_timeout parameter, the timeout period in which the system waits for a block to be written to a data source is too small.
- Solution:
Add the net_write_timeout or net_read_timeout parameter to the URL that is used to connect to the ApsaraDB RDS for MySQL database and set the parameter to a larger value, such as jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000. You can also adjust the value of the net_write_timeout or net_read_timeout parameter in the ApsaraDB RDS console.
- Suggestion:
If the batch synchronization node can be rerun, configure the related settings to enable the system to automatically rerun the node.

What do I do if the error message [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the specified table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed
is returned when I run a batch synchronization node to synchronize data to a MySQL data source?
Cause:
The batch synchronization node still tries to read data from the source after the period of time indicated by the default value of the wait_timeout
parameter is exceeded. The default value of the wait_timeout parameter is 28800 seconds (8 hours).
Solution:
Modify the configuration file my.cnf of the MySQL server. Set the wait_timeout and interactive_timeout parameters in the configuration file to 2592000 seconds. Then, restart the MySQL server, and execute the show variables like ‘%wait_time%’
statement on the MySQL server to check whether the settings take effect. If you use the Windows operating system, the name of the configuration file of the MySQL server is my.ini.
What do I do if the error message The last packet successfully received from the server was 902,138 milliseconds ago
is returned when I run a batch synchronization node to read data from a MySQL data source?
If the CPU utilization is normal but the memory usage is high, the data source may be disconnected from Data Integration.
If the batch synchronization node can be automatically rerun, we recommend that you configure the related settings to enable the system to automatically rerun the node after an error occurs. For more information, see Configure time properties.
What do I do if the error message Communications link failure
is returned when I run a batch synchronization node to read data from or write data to a MySQL data source?
- Read data from a MySQL data source:
- Problem description:
The following error message is returned when I run a batch synchronization node to read data from a MySQL data source:
Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
. - Possible cause:
Slow SQL queries in the related MySQL database result in timeout when you read data from the MySQL data source.
- Solutions:
- Check whether the WHERE clause is specified to ensure that an index is added for the filter field.
- Check whether a large amount of data exists in the source tables. If a large amount of data exists in the source tables, we recommend that you run multiple nodes to read data.
- Check the database logs to find the SQL queries that are blocked and contact the database administrator to resolve the issue.
- Problem description:
- Write data to a MySQL data source:
- Problem description:
The following error message is returned when I write data to a MySQL data source:
Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More...
. - Possible cause:
A socket timeout occurred due to slow SQL queries. The default value of the SocketTimeout parameter of Taobao Distributed Data Layer (TDDL) connections is 12 seconds. If the execution time of an SQL statement on a MySQL data source exceeds 12 seconds, a TDDL-4614 error is reported. This error occasionally occurs when the data volume is large or the server is busy.
- Solutions:
- Rerun the batch synchronization node after the database becomes stable.
- Contact the database administrator to adjust the value of the SocketTimeout parameter.
- Problem description:
What do I do if the error message Duplicate entry 'xxx' for key 'uk_uk_op'
is returned when I run a batch synchronization node to synchronize data?
- Problem description: The error message
Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op'
is returned when I run a batch synchronization node to synchronize data. - Cause: In Data Integration, different instances of the same data synchronization node cannot be run at the same time. Therefore, multiple synchronization instances that are configured based on the same JSON settings cannot be run at the same time. For a data synchronization node whose instances are run at 5-minute intervals, the instance that is scheduled to run at 00:00 and the instance that is scheduled to run at 00:05 are both run at 00:05 due to the delay of the ancestor node of the data synchronization node. As a result, one of the instances fails to be run. This issue may occur if you backfill data for or rerun a data synchronization node that is running.
- Solution: Adjust the time at which the instances are run and make sure that the instances are run at different points in time. We recommend that you configure instances generated for nodes that are scheduled by hour to depend on instances generated for the same nodes in the previous cycle. For more information, see Scenario 2: Configure scheduling dependencies for a node that depends on last-cycle instances.
What do I do if the error message plugin xx does not specify column
is returned when I run a batch synchronization node to synchronize data?
A possible cause is that field mappings are not correctly configured or the column parameter is not correctly configured for the related Reader or Writer plug-in.
- Check whether the mappings between the source fields and destination fields are correctly configured.
- Check whether the column parameter is correctly configured for the related Reader or Writer plug-in.
What do I do if the error message The download session is expired
is returned when I run a batch synchronization node to read data from a MaxCompute table?
- Problem description:
The following error message is returned when I read data from a MaxCompute table:
Code:DATAX_R_ODPS_005:Failed to read data from a MaxCompute table, Solution:[Contact the administrator of MaxCompute]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired
. - Possible cause:
If you want to read data from a MaxCompute table, you must run a Tunnel command in MaxCompute to upload or download data. The lifecycle for each Tunnel session on the server spans 24 hours after the session is created. If a batch synchronization node is run for more than 24 hours, the node fails and exits. For more information about the Tunnel service, see Usage notes.
- Solution:
Appropriately increase the number of parallel threads for the node and plan the amount of data that you want to synchronize to ensure that the data can be synchronized within 24 hours.
What do I do if the error message Error writing request body to server
is returned when I run a batch synchronization node to write data to a MaxCompute table?
- Problem description:
The following error message is returned when I write data to a MaxCompute table:
Code:[OdpsWriter-09], Description:[Failed to write data to the destination MaxCompute table.]. - Failed to write Block 0 to the destination MaxCompute table, uploadId=[202012081517026537dc0b0160354b]. Contact the administrator of MaxCompute. - java.io.IOException: Error writing request body to server
. - Possible causes:
- Possible cause 1: The data type is incorrect. The data type of source data does not comply with the data type specifications of MaxCompute. For example, the value 4.2223 cannot be written to the destination MaxCompute table in the format of DECIMAL(precision,scale), such as DECIMAL(18,10).
- Possible cause 2: The MaxCompute block is abnormal, or the communication is abnormal.
- Solution:
Convert the data type of the data that is to be synchronized into a data type that is supported by the destination.
What do I do if the JSON data returned based on the path:[] condition is not of the ARRAY type when I use RestAPI Writer to write data?

What do I do if the error message Host is blocked
is returned when I run a batch synchronization node to synchronize data from an Amazon RDS data source?
Disable the health check for load balancing feature for the Amazon RDS data source
.
What do I do if the error message no master
is returned when I run a batch synchronization node to synchronize data from a MongoDB data source?
DataWorks does not support data synchronization from a secondary database
. If you configure a secondary database as the source of a batch synchronization node, this error is returned.
What do I do if the error message MongoExecutionTimeoutException: operation exceeded time limit
is returned when I run a batch synchronization node to synchronize data from a MongoDB data source?
- Cause:
This error occurs due to cursor timeout.
- Solution:
Specify a larger value for the cursorTimeoutInMs parameter.
What do I do if the error message DataXException: operation exceeded time limit
is returned when I run a batch synchronization node to synchronize data from a MongoDB data source?
Increase the number of parallel threads and the number of data records that can be read at a time for the batch synchronization node.
What do I do if the error message ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:
is returned when I run a batch synchronization node to synchronize data from an Elasticsearch data source?
- Cause:
Elasticsearch Reader cannot parse the date format of fields of a date data type because the format properties are not configured for the mappings of some fields of a date data type in the Elasticsearch data source.
- Solutions:
- Configure the dateFormat parameter for Elasticsearch Reader in the same format as the format properties that are configured for the mappings of the fields of a date data type in the Elasticsearch data source. The date formats that are specified in the dateFormat parameter must be separated by double vertical bars (||). The format properties must include the formats of the date data types that you use. Sample code:
"parameter" : { "column": [ { "name": "col1", "type": "date" } ], "dateFormat" : "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss", }
- Configure format properties for the mappings of all fields of a date data type in the Elasticsearch data source.
- Configure the dateFormat parameter for Elasticsearch Reader in the same format as the format properties that are configured for the mappings of the fields of a date data type in the Elasticsearch data source. The date formats that are specified in the dateFormat parameter must be separated by double vertical bars (||). The format properties must include the formats of the date data types that you use. Sample code:
What do I do if the error message com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[Configuration file contains invalid information. Check the configuration.] - Invalid configuration information. Configuration information is not of the JSON format: illegal ref, [ . Provide configuration information in the JSON format
is returned when I run a batch synchronization node to synchronize data from an Elasticsearch data source?
- Cause:
Keywords such as $ref may exist in an Elasticsearch index or field due to the limits of keywords supported by Fastjson. As a result, the error may be returned.
- Solution:
Elasticsearch Reader does not support synchronization of data from Elasticsearch indexes that contain fields with the keyword $ref in field names. For more information, see Elasticsearch Reader.
What do I do if the error message version_conflict_engine_exception
is returned when I run a batch synchronization node to synchronize data to an Elasticsearch data source?
- Cause:
The optimistic locking mechanism of Elasticsearch is triggered due to a version conflict. The version conflict occurs because the current version number of the Elasticsearch data source and the version number specified in an update command are different. When the version number of the Elasticsearch data source is updated, data in an index in the Elasticsearch data source is deleted.
- Solutions:
- Check whether a data deletion operation is performed.
- Change the type of action that can be performed when data is written to the Elasticsearch data source from update to indexing.
What do I do if the error message illegal_argument_exception
is returned when I run a batch synchronization node to synchronize data to an Elasticsearch data source?
- Cause:The other_params parameter is not configured for fields for which advanced properties, such as similarity and properties, are configured. As a result, Elasticsearch Writer cannot identify the advanced properties that are configured.
- Solution:Configure the
other_params
parameter for the fields and configure the advanced property similarity in theother_params
parameter.{"name":"dim2_name",...,"other_params":{"similarity":"len_similarity"}}
What do I do if the error message dense_vector
is returned when I run a batch synchronization node to synchronize data from fields of an array data type in a MaxCompute data source to an Elasticsearch data source?
- Cause:You cannot use a batch synchronization node to synchronize data from fields of the
dense_vector
type to Elasticsearch. Supported data types:ID,PARENT,ROUTING,VERSION,STRING,TEXT,KEYWORD,LONG, INTEGER,SHORT,BYTE,DOUBLE,FLOAT,DATE,BOOLEAN,BINARY, INTEGER_RANGE,FLOAT_RANGE,LONG_RANGE,DOUBLE_RANGE,DATE_RANGE, GEO_POINT,GEO_SHAPE,IP,IP_RANGE,COMPLETION,TOKEN_COUNT,OBJECT,NESTED;
- Solutions:
- Use self-managed mappings. We recommend that you do not use mappings that are automatically created by Elasticsearch Writer for the related Elasticsearch index.
- Change the data type of the fields of an array data type to NESTED.
- Configure the following settings:
dynamic=true
andcleanup=false
.