All Products
Search
Document Center

DataWorks:Batch synchronization

Last Updated:Mar 19, 2024

This topic provides answers to some frequently asked questions about batch synchronization.

Overview

Category

Question or issue

O&M of batch synchronization tasks

Errors caused by configurations for batch synchronization tasks

Errors for specific plug-ins

Scenarios of batch synchronization

Error messages

Error messages

  • Error message returned for batch synchronization of data: [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.

  • Error message returned for batch synchronization of data: OutOfMemoryError: Java heap space

  • Error message returned for batch synchronization of data from OSS: AccessDenied The bucket you access does not belong to you.

  • Error message returned for storing data written to Redis in hash mode: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!

  • Error message returned for batch synchronization of data from PostgreSQL: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

  • Error message returned for batch synchronization of data from or to MySQL: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout、' on the server

  • Error message returned for batch synchronization of data from MySQL: The last packet successfully received from the server was 902,138 milliseconds ago

  • Error message returned for batch synchronization of data to MySQL: [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the specified table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed

  • Error message returned for batch synchronization of data: Communications link failure

  • Error message returned for batch synchronization of data: Duplicate entry 'xxx' for key 'uk_uk_op'

  • Error message returned for batch synchronization of data: plugin xx does not specify column

  • Error message returned for batch synchronization of data from a MaxCompute table: The download session is expired.

  • Error message returned for batch synchronization of data to a MaxCompute table: Error writing request body to server

  • Error message returned for using RestAPI Writer to write data: JSON data returned based on the path:[] condition is not of the ARRAY type

  • Error message returned for batch synchronization of data from Amazon Relational Database Service (Amazon RDS): Host is blocked

  • Error message returned for batch synchronization of data from MongoDB: no master

  • Error message returned for batch synchronization of data from MongoDB: MongoExecutionTimeoutException: operation exceeded time limit

  • Error message returned for using Elasticsearch Reader to read data: ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:

  • Error message returned for using Elasticsearch Reader to read data: com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[Configuration file contains invalid information. Check the configuration.] - Invalid configuration information. Configuration information is not of the JSON format: illegal ref, [ . Provide configuration information in the JSON format.

  • Error message returned for batch synchronization of data to Elasticsearch: version_conflict_engine_exception.

  • Error message returned for batch synchronization of data to Elasticsearch: illegal_argument_exception.

  • Error message returned for batch synchronization of fields of an array data type from MaxCompute to Elasticsearch: dense_vector.

  • Error message returned for batch synchronization of data to an on-premises Hive data source: Could not get block locations.

  • Error message returned for batch synchronization of data from MongoDB: no such cmd splitVector

  • Error message returned for batch synchronization of data to MongoDB: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"

  • Error message returned for a batch synchronization task: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

Why is the network connectivity test of a data source successful, but the batch synchronization task that uses the data source fails to be run?

  • If the data source has passed the network connectivity test, you can test the network connectivity of the data source again to ensure that the resource group that you use is connected to the data source and no changes are made on the data source.

  • Check whether the resource group that is connected to the data source is the same as the resource group that you use to run the batch synchronization task.

    Check the resource group that is used to run a synchronization task:

    • If the synchronization task is run on the shared resource group for Data Integration, the log generated for the synchronization task contains the following information: running in Pipeline[basecommon_ group_xxxxxxxxx].

    • If the synchronization task is run on a custom resource group for Data Integration, the log generated for the synchronization task contains the following information: running in Pipeline[basecommon_xxxxxxxxx].

    • If the synchronization task is run on an exclusive resource group for Data Integration, the log generated for the synchronization task contains the following information: running in Pipeline[basecommon_S_res_group_xxx].

  • If the synchronization task that is scheduled to run in the early morning occasionally fails but reruns successfully, check the load of the data source at the time when the failure occurred.

How do I change the resource group that is used to run a batch synchronization task of Data Integration?

You can go to DataStudio to change the resource group that is used to debug a batch synchronization task or Operation Center to change the resource group that is used to schedule and run a batch synchronization task. For more information, see Create and use an exclusive resource group for Data Integration.

How do I locate and handle dirty data?

Definition: If an exception occurs when a single data record is written to a destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data. Data records that fail to be written to the destination are considered as dirty data.

Impact: Dirty data is not written to the destination. You can control whether dirty data is allowed and the maximum number of dirty data records that are allowed. By default, Data Integration allows the generation of dirty data. When you configure a batch synchronization task, you can specify the maximum number of dirty data records that are allowed. For more information, see Configure a batch synchronization task by using the codeless UI.

  • Dirty data is allowed for a batch synchronization task: If a dirty data record is generated, the batch synchronization task continues to run. However, the dirty data record is discarded and is not written to the destination.

  • Take note of the following items when you specify the maximum number of dirty data records that are allowed for a batch synchronization task:

    • If you set the maximum number of dirty data records that are allowed to 0, the batch synchronization task fails and exits when a dirty data record is generated.

    • If you set the maximum number of dirty data records that are allowed to x, the batch synchronization task fails and exits when the number of dirty data records that are generated exceeds x. The batch synchronization task continues to run if the number of dirty data records that are generated is less than x. However, the dirty data records are discarded and are not written to the destination.

Analysis of dirty data generated during data synchronization:

  • Scenario 1:

    • Problem description: The following error message is returned: {"message":"Dirty data is found in the data that is to be written to the destination MaxCompute table: The [third] field contains dirty data. Check and correct the data, or increase the threshold value for the number of data dirty records and ignore this dirty data record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}.

    • Solutions: The logs show that the third field contains dirty data. You can identify the cause of dirty data based on the following scenarios:

      • If the error for the dirty data is reported by a writer, check the table creation statement in the writer. Make sure that the data size of the specified field in the destination MaxCompute table is less than the data size of the same field in the source MySQL table.

      • The following requirements must be met before you can synchronize data from the source to the destination: 1. The data types of source fields must match the data types of destination fields. For example, data of the VARCHAR type in source fields cannot be written to the destination fields that are used to store data of the INT type. 2. The size of data defined by the data types of destination fields must be sufficient to receive the data in the mapped fields in the source. For example, you can synchronize data of the LONG, VARCHAR, or DOUBLE type from the source to the fields that are used to store data of a string or text type.

      • If the dirty data error is not clear, copy and print out dirty data records, observe the data, and then compare the data type of the data records with the data type of destination fields to identify dirty data.

      Example:

      {"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}

      byteSize: the number of bytes. index: 25: the 26th field. rawData: a specific value. type: the data type.

  • Scenario 2:

    • Problem description: An error for dirty data is reported when data that is read from a MySQL data source by DataX contains null values.

    • Solution: Check the data type of the destination fields to which the source fields that contain null values are written. If the data type of source fields is different from the data type of destination fields, an error is reported. For example, if data in STRING-type source fields that contain null values are written to the INT-type destination fields, an error is reported.

How do I view dirty data?

Go to the log details page of a batch synchronization task and click the link next to Detail log url to view run logs generated for the synchronization task and dirty data information.查看日志

Is all data not synchronized if the number of dirty data records generated during data synchronization exceeds the specified upper limit?

If the number of dirty data records generated during data synchronization exceeds the specified upper limit, the related batch synchronization task stops running. The data that is synchronized to the destination before the batch synchronization task stops running is retained. If you set the upper limit for the number of dirty data records that are allowed during data synchronization to N for a batch synchronization task, a maximum of N dirty data records are allowed during data synchronization. If the upper limit is exceeded, the batch synchronization task stops running. For example, if you set the upper limit to 0, the batch synchronization task stops running when the first dirty data record is detected. In this case, a small number of data records may be synchronized to the destination before the batch synchronization task stops. If the first data record is a dirty data record, no data records can be synchronized.

What do I do if a batch synchronization task runs for an extended period of time?

Possible cause 1: The SQL statement that is executed before or after data synchronization is time-consuming, or the shard key is not appropriately configured.

  • The SQL statement that you configure to execute before or after data synchronization takes an extended period of time to execute in databases. This prolongs the period that is required to run the batch synchronization task.

  • The shard key is not appropriately configured. As a result, the batch synchronization task is run at a slow speed.

    A batch synchronization task shards data based on the shard key that you configure. Then, Data Integration uses parallel threads to synchronize the data that is sharded. This improves the efficiency of data synchronization. For information about whether a shard key is required by a plug-in, see the related topic for the plug-in.

Solution 1:

  • When you configure the SQL statement to be executed before or after data synchronization, we recommend that you use indexed fields to filter data.

  • If the related plug-in requires a shard key, we recommend that you appropriately configure a shard key. For example, when you configure a shard key for MySQL Reader, take note of the following items:

    • We recommend that you specify the name of the primary key column of a source table as a shard key. This way, data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards.

    • A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, only a single thread is used to read data.

    • If no shard key is specified, only a single thread is used to read data.

Possible cause 2: The batch synchronization task waits for resources.

Solution 2: If the number of parallel threads on the related exclusive resource group for Data Integration reaches the upper limit, the logs show that the synchronization task waits for resources for an extended period of time. For more information about the specific causes and solutions, see Why does a data synchronization task in Data Integration wait for resources for an extended period of time?

Note

DataWorks uses resource groups for scheduling to issue batch synchronization tasks to resource groups for Data Integration and runs the synchronization tasks on the resource groups for Data Integration. Therefore, a batch synchronization task also occupies the resources of a resource group for scheduling. If the resources are occupied for a long period of time, other batch synchronization tasks and tasks of other types may be blocked.

How do I handle a dirty data error that is caused by encoding format configuration issues or garbled characters?

  • Problem description:

    If data contains emoticons, a dirty data error message similar to the following error message may be returned during data synchronization: [13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"}],"type":"writer"}.

  • Possible causes:

    • utf8mb4 is not configured for a data source. As a result, an error is reported when data that contains emoticons is synchronized.

    • Data in the source contains garbled characters.

    • The encoding format of a data source is different from the encoding format of the client.

    • The encoding format of the browser is different from the encoding format of a data source or the client. As a result, the preview fails or the previewed data contains garbled characters.

  • Solutions:

    The solutions vary based on the cause.

    • If data in the source contains garbled characters, process the data before you run the batch synchronization task.

    • If the encoding format of the data source is different from the encoding format of the client, modify the configuration for the encoding format of the data source and make sure that the encoding format of the data source is the same as the encoding format of the client.

    • If the encoding format of the browser is different from the encoding format of the data source or client, modify the configuration for the encoding format of the browser and make sure that the encoding format is the same as the encoding format of the data source or client. Then, preview the data.

    You can perform the following operations:

    1. If your data source is added in connection string mode and you want to add the configuration for the encoding format utf8mb4 for the data source, append the configuration to the related JDBC URL. Example: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    2. If your data source is added in Alibaba Cloud instance mode and you want to add the configuration for the encoding format utf8mb4 for the data source, append the configuration to the name of the related database. Example: database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.

    3. Change the encoding format of the data source to utf8mb4. For example, you can change the encoding format of an ApsaraDB RDS database in the ApsaraDB RDS console.

      Note

      You can run the following command to set the encoding format of the ApsaraDB RDS data source to utf8mb4: set names utf8mb4. You can run the following command to view the encoding format of an ApsaraDB RDS data source: show variables like 'char%'.

What do I do if an SSRF attack is detected in a batch synchronization task and the error message Task have SSRF attacks is returned?

If a private IP address is specified when the data source is added to DataWorks and the data source is used for a batch synchronization task, you cannot run the synchronization task on the shared resource group for Data Integration. You must run the synchronization task on an exclusive resource group for Data Integration. For more information about exclusive resource groups for Data Integration, see Create and use an exclusive resource group for Data Integration.

What do I do if a batch synchronization task occasionally fails to be run?

If a batch synchronization task occasionally fails to be run, a possible cause is that the IP address whitelist of a data source used for the synchronization task is incompletely configured.

  • If the synchronization task is run on an exclusive resource group for Data Integration, configure the IP address whitelist based on the following instructions:

    • If you have added the IP address of the elastic network interface (ENI) of the exclusive resource group for Data Integration to the IP address whitelist of the data source and the resource group is scaled out, you must add the ENI IP address of the resource group after the scale-out to the IP address whitelist.

    • We recommend that you directly add the CIDR block of the vSwitch with which the exclusive resource group for Data Integration is associated to the IP address whitelist of the data source. Otherwise, each time the resource group is scaled out, you must update the ENI IP address in the IP address whitelist of the data source. For more information, see Configure an IP address whitelist.

  • If the synchronization task is run on the shared resource group for Data Integration, configure the IP address whitelist based on the following instructions:

    Make sure that IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides are added to the IP address whitelist of the data source. For more information, see Add the IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides to an IP address whitelist of a data source.

If the configuration of the IP address whitelist of the data source is complete, check whether the connection between the data source and Data Integration is interrupted due to the heavy load of the data source.

What do I do if a field is added to or updated in a source table of a batch synchronization task?

Go to the configuration tab of the batch synchronization task and update the mappings between the fields in the source table and the fields in the destination table. Then, commit and run the batch synchronization task for the update to take effect.

What do I do if an error occurs when I use the root user to add a MongoDB data source?

Change the username. You must use the username of the database to which the table you want to synchronize belongs when you add a MongoDB data source. The username of the root user is not supported.

For example, if you want to synchronize data from the name table in the test database, use the username of the test database.

How do I convert the values of the variables in the query parameter into values of the TIMESTAMP data type when I synchronize incremental data from a table of a MongoDB database?

Use assignment nodes to convert data of the DATE type into data of the TIMESTAMP type and use the timestamp value as an input parameter for data synchronization from MongoDB. For more information, see How do I synchronize incremental data of the TIMESTAMP type from a table of a MongoDB database?.

After data is synchronized from a MongoDB data source to a destination, the time zone of the data is 8 hours ahead of the original time zone of the data. What do I do?

Check the configurations of the cursorTimeoutInMs and query parameters. For information about the parameters, see MongoDB Reader.

What do I do if a batch synchronization task fails to synchronize data changes in a MongoDB data source to a destination?

Restart the synchronization task after a period of time without changing the configuration of the query parameter.

Is the number of OSS objects from which OSS Reader can read data limited?

In Data Integration, the number of OSS objects from which OSS Reader can read data is not limited. The maximum number of OSS objects from which data can be read is determined by the JVM parameters that are configured for a batch synchronization task. To prevent out of memory (OOM) errors, we recommend that you do not use an asterisk (*) to specify the objects from which you want to read data.

What do I do if data fails to be written to DataHub because the amount of data that I want to write at a time exceeds the upper limit?

  • Problem description:

    The following error message is returned when I run a batch synchronization task to write data to DataHub: ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Failed to write data to DataHub.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42).

  • Possible cause:

    The amount of data that you want to write to DataHub at a time exceeds the upper limit that is allowed by DataHub. The following parameters specify the maximum amount of data that can be written to DataHub at a time:

    • maxCommitSize: specifies the maximum amount of the buffered data that Data Integration can accumulate before it transfers the data to the destination. Unit: MB. The default value is 1 MB, which is 1,048,576 bytes.

    • batchSize: specifies the maximum number of the buffered data records that a single batch synchronization task can accumulate before it transfers the data records to the destination.

  • Solution:

    Set the maxCommitSize and batchSize parameters to smaller values.

Is historical data replaced each time data is written to Lindorm in bulk mode provided by Lindorm?

Yes, data in the same column and row is replaced, and the other data remains unchanged. The logic is the same as that when an API operation is used to write data to Lindorm.

How do I query all fields in an index of an Elasticsearch cluster?

Use a curl command to query the mappings of an Elasticsearch index and extract all fields from the mappings.

  • Sample curl commands that can be used:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'
  • Sample result that is returned:

    {
        "indexname": {
            "mappings": {
                "typename": {
                    "properties": {
                        "field1": {
                            "type": "text"
                        },
                        "field2": {
                            "type": "long"
                        },
                        "field3": {
                            "type": "double"
                        }
                    }
                }
            }
        }
    }

    In the preceding result, all fields in the index and the properties of the fields are contained in the properties parameter. In this example, the index contains field1, field2, and field3.

How do I configure a batch synchronization task to synchronize data from indexes with dynamic names in an Elasticsearch data source to another data source?

When you configure indexes from which you want to read data, you can use date-related variables. This way, the system can calculate index-related strings and enable Elasticsearch Reader to read data from indexes with dynamic names. To configure and run such a batch synchronization task, perform the following steps: Define date-related variables, configure the defined variables in the code to specify indexes, and then deploy and run the synchronization task.

  1. Define date-related variables: When you configure scheduling settings for the batch synchronization task on the configuration tab of the synchronization task by using the codeless user interface (UI), you must assign scheduling parameters to variables to define the variables. In the following figure, var1 indicates the date of the scheduling time of the batch synchronization task, and var2 indicates the data timestamp of the batch synchronization task, which is one day before the date of the scheduling time.定义日期变量

  2. Configure the defined variables in the code to specify indexes: Switch from the codeless UI to the code editor. Configure the defined variables in the code of the batch synchronization task in the ${Variable name} format to specify the indexes from which you want Elasticsearch Reader to read data, as shown in the following figure.配置索引变量

  3. Deploy and run the batch synchronization task: Test the batch synchronization task, and commit and deploy the synchronization task to Operation Center if the test result meets your expectations. You can configure time properties for the batch synchronization task to enable the system to periodically run the synchronization task, or use the data backfill feature to run the synchronization task.

    1. Click the Run with Parameters icon in the top toolbar of the configuration tab of the batch synchronization task to test the synchronization task. After you click the icon to test the batch synchronization task, the system replaces the scheduling parameters configured for the task with actual values. After the test is complete, you can view the run logs of the batch synchronization task to check whether the indexes from which data is read meet your expectations.

      Note

      You must enter values for the replacement test after you click the Run with Parameters icon.

      运行运行

    2. If the test result meets your expectations, click the Save and Submit icons in the top toolbar of the configuration tab of the batch synchronization task in sequence to commit the synchronization task to the production environment.

      If your workspace is in standard mode, click Deploy to deploy the batch synchronization task to the production environment.

  4. View the running result of the batch synchronization task.

    Configuration that is used to specify indexes in the code: "index": "esstress_1_${var1}_${var2}"

    Replacement result: esstress_1_20230106_20230105

    运行结果

How do I specify table names in the configurations of a batch synchronization task?

If you want to synchronize data from the tables that are named in a consistent format, such as the tables that are named by date and have the same table schemas (orders_20170310, orders_20170311, and orders_20170312), specify table names by using the scheduling parameters when you configure the batch synchronization task in the code editor. This way, the batch synchronization task automatically reads table data generated on the previous day from the source every morning. For more information about how to configure a batch synchronization task by using the code editor, see Configure a batch synchronization task by using the code editor.

For example, if the current day is March 15, 2017, the batch synchronization task can automatically read data of the orders_20170314 table from the source.自定义表名

In the code editor, use a variable to specify the name of a source table, such as orders_${tablename}. The tables are named by date. If you want the batch synchronization task to read data generated on the previous day from the source every day, assign the value ${yyyymmdd} to the ${tablename} variable in the parameter configurations of the task.

Note

For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

What do I do if the table that I want to select is not displayed when I configure a batch synchronization task by using the codeless UI?

When you configure a batch synchronization task by using the codeless UI, the system displays only the first 25 tables in the selected source by default. If the selected source contains more than 25 tables and the table that you want to select is not displayed, you can search for the table by table name, or switch to the code editor and configure the batch synchronization task by using the code editor.

What are the items that I must take note of when I add a source field in a batch synchronization task that synchronizes data from a MaxCompute table?

  1. Constants are supported. Each constant must be enclosed in single quotation marks ('), such as 'abc' and '123'.

  2. You can add scheduling parameters, such as '${bizdate}'. For information about how to use scheduling parameters, see Supported formats of scheduling parameters.

  3. You can specify the partition key columns from which you want to read data, such as the partition key column pt.

  4. If the field that you entered cannot be parsed, the value of the Type parameter for the field is displayed as Custom.

  5. MaxCompute functions are not supported.

  6. If the value of Type for the fields that you manually added, such as the partition key columns of MaxCompute tables or fields that cannot be previewed in LogHub, is Custom, the batch synchronization task can still be run.

How do I configure a batch synchronization task to read data from partition key columns in a MaxCompute table?

Add a source field when you configure mappings between source fields and destination fields, and specify the name of the partition key column, such as pt.

How do I configure a batch synchronization task to read data from multiple partitions in a MaxCompute table?

Locate the partitions from which you want to read data.

  • You can use Linux Shell wildcards to specify the partitions. An asterisk (*) indicates zero or multiple characters, and a question mark (?) indicates a single character.

  • The partitions that you specify must exist in the source table. Otherwise, the system reports an error for the batch synchronization task. If you want the batch synchronization task to be successfully run even if the partitions that you specify do not exist in the source table, use the code editor to modify the code of the synchronization task and add "successOnNoPartition": true to the configurations of MaxCompute Reader.

For example, the partitioned table test contains four partitions: pt=1,ds=hangzhou, pt=1,ds=shanghai, pt=2,ds=hangzhou, and pt=2,ds=beijing. In this case, you can make the configurations based on the following instructions:

  • To read data from the partition pt=1,ds=hangzhou, specify "partition":"pt=1,ds=shanghai".

  • To read data from all the ds partitions in the pt=1 partition, specify "partition":"pt=1,ds=*".

  • To read data from all the partitions in the test table, specify "partition":"pt=*,ds=*".

You can also perform the following operations in the code editor to specify other conditions based on which data can be read from the partitions:

  • To read data from the partition that stores the largest amount of data, add /*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR) to the configuration of MaxCompute Reader.

  • To filter data based on filter conditions, add /*query*/ pt+Expression to the configuration of MaxCompute Reader. For example, /*query*/ pt>=20170101 and pt<20170110 indicates that you want to read the data that is generated from January 1, 2017 to January 9, 2017 from all the pt partitions in the test table.

Note

MaxCompute Reader processes the content after /*query*/ as a WHERE clause.

What do I do if a batch synchronization task fails to be run because the name of a column in the source table is a keyword?

  • Cause: The value of the column parameter contains reserved fields or fields whose names start with a number.

  • Solution: Use the code editor to configure the batch synchronization task and escape special fields in the configuration of the column parameter. For information about how to configure a batch synchronization task in the code editor, see Configure a batch synchronization task by using the code editor.

    • MySQL uses grave accents (`) as escape characters to escape keywords in the following format: `Keyword`.

    • Oracle and PostgreSQL use double quotation marks (") as escape characters to escape keywords in the following format: "Keyword".

    • SQL Server uses brackets [] as escape characters to escape keywords in the following format: [Keyword].

    Example for a MySQL escape character字段冲突

  • In this example, a MySQL data source is used.

    1. Execute the following statement to create a table named aliyun, which contains a column named table: create table aliyun (`table` int ,msg varchar(10));

    2. Execute the following statement to create a view and assign an alias to the table column: create view v_aliyun as select `table` as col1,msg as col2 from aliyun;

      Note
      • MySQL uses table as a keyword. If the name of a column in the source table is table, an error is reported during data synchronization. In this case, you must create a view to assign an alias to the table column.

      • We recommend that you do not use a keyword as the name of a column.

    3. Execute the preceding statement to assign an alias to the column whose name is a keyword. When you configure a batch synchronization task for the MySQL data source, use the v_aliyun view to replace the aliyun table.

Why is no data obtained when I run a batch synchronization task to synchronize data from a LogHub data source whose fields contain values?

Field names are case-sensitive for LogHub Reader. Check the field name configuration for LogHub Reader.

Why is some data missed when I run a batch synchronization task to read data from a LogHub data source?

In Data Integration, a batch synchronization task reads data from a LogHub data source at the time when the data is transferred to LogHub. Log on to the LogHub console and check whether the value of the metadata field receive_time, which is configured for reading data, is within the time range specified for the batch synchronization task.

What do I do if the LogHub fields that are read based on the field mappings configured for a batch synchronization task are not the expected fields?

Manually modify the configuration of the fields in the DataWorks console.

I configured the endDateTime parameter to specify the end time for reading data from a Kafka data source, but some data that is read is generated at a point in time later than the specified end time. What do I do?

Kafka Reader reads data from a Kafka data source in batches. If data that is generated later than the time specified by endDateTime is found in a batch of read data, Kafka Reader stops reading data. However, the data generated later than the end time is also written to the destination.

  • You can configure the skipExceedRecord parameter to specify whether to synchronize such data to the destination. For more information, see Kafka Reader. To prevent data loss, we recommend that you set the skipExceedRecord parameter to false to ensure that the data generated later than the end time is not skipped.

  • You can use the max.poll.records parameter in Kafka to specify the amount of data to read at a time. Configure this parameter and the number of parallel threads to control the excess data amount that is allowed. The allowed excess data amount must meet the following condition: Allowed excess data amount < Value of max.poll.records × Number of parallel threads.

What do I do if a batch synchronization task used to synchronize data from a Kafka data source does not read data or runs for a long period of time even if only a small amount of data is stored in the Kafka data source?

  • Cause:

    This issue occurs because the Exit Strategy parameter of the batch synchronization task is set to “Exit when reach configured end offset or time”. In this case, for all partitions in which the current synchronized offset is within the time range specified by a start offset and an end offset, at least one record whose current synchronized offset is later than or the same as the specified end offset must be read before the batch synchronization task can exit. Otherwise, the batch synchronization task always tries to extract data.

  • Solution:

    Reduce the number of Kafka partitions from which you want to synchronize data, or write a heartbeat record to each Kafka partition at a regular interval to enable the batch synchronization task to meet exit conditions.

How do I remove the random strings that appear in the data I write to OSS?

When you use OSS Writer to write files to OSS, take note of the name prefixes of the files. OSS simulates the directory structure by adding prefixes and delimiters to file names, such as "object": "datax". This way, the names of the files start with datax and end with random strings. The number of files determines the number of slices into which a batch synchronization task is split.

If you do not want to use a random universally unique identifier (UUID) as the suffix, we recommend that you set the writeSingleObject parameter to true. For more information, see the description of the writeSingleObject parameter in OSS Writer.

For more information about OSS Writer, see OSS Writer.

How do I configure a batch synchronization task to synchronize data from tables in sharded MySQL databases to the same MaxCompute table?

You can configure the batch synchronization task based on the instructions provided in the topic for MySQL Reader. For more information, see MySQL Reader.

What do I do if a full scan for a MaxCompute table slows down data synchronization because no index is added in the WHERE clause?

  • Sample scenario

    The following SQL statement is executed:

    select bid,inviter,uid,createTime from `relatives` where createTime>='2016-10-2300:00:00'and reateTime<'2016-10-24 00:00:00';

    The batch synchronization task starts to run at 11:01:24.875 on October 25, 2016 and starts to return results at 11:11:05.489 on October 25, 2016. Data Integration is waiting for the database to return SQL query results. However, it takes a long period of time before MaxCompute can respond.

  • Cause

    When a query is performed based on the WHERE clause, the createTime column is not indexed, which results in a full table scan.

  • Solution

    If you use the WHERE clause for a query, we recommend that you use an indexed column or add an index to the column that you want to scan. This can improve the data synchronization performance.

What do I do if the Chinese characters that are synchronized to a MySQL table contain garbled characters because the encoding format of the related MySQL data source is utf8mb4?

If the MySQL data source is added in connection string mode, add information of the encoding format of the MySQL data source to the JDBC URL that is used to connect to the related MySQL database. Example: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45. For information about how to add a MySQL data source, see Add a MySQL data source.

Can I use a function supported by a source to aggregate fields when I synchronize data by using an API operation? For example, can I use a function supported by a MaxCompute data source to aggregate Fields a and b in a MaxCompute table as a primary key for synchronizing data to Lindorm?

No, functions supported by a source cannot be used to aggregate fields when you synchronize data by using an API operation. You must use a function supported by the source to process the data that you want to synchronize and then synchronize the data to the related destination.

Can I use only the ALTER TABLE statement to modify the time to live (TTL) of a table from which data needs to be synchronized?

Yes, you can use only the ALTER TABLE statement to modify the TTL of a table from which you want to synchronize data. This is because the TTL is configured for a table. No TTL-related settings are provided when you configure a batch synchronization task.

How do I configure Elasticsearch Reader to synchronize the properties of object fields or nested fields, such as object.field1?

If you want to synchronize the properties of object fields, you can configure the related batch synchronization task by using only the code editor. When you configure the batch synchronization task by using the code editor, configure the following setting for the multi parameter and configure the column parameter in the Property.Child property format.

"multi":{
   "multi":true 
 }

Sample configuration:

#Example:
##Data in an Elasticsearch data source
"hits": [
    {
        "_index": "mutiltest_1",
        "_type": "_doc",
        "_id": "7XAOOoMB4GR_1Dmrrust",
        "_score": 1.0,
        "_source": {
            "level1": {
                "level2": [
                    {
                        "level3": "testlevel3_1"
                    },
                    {
                        "level3": "testlevel3_2"
                    }
                ]
            }
        }
    }
]
##Configuration for Elasticsearch Reader
"parameter": {
  "column": [
      "level1",
      "level1.level2",
      "level1.level2[0]"
  ],
  "multi":{
        "multi":true
    }
}
##Data written to the destination: three columns of one row. The order of the columns is the same as the order that is specified in the configuration of Elasticsearch Reader. 
COLUMN              VALUE
level1:             {"level2":[{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]}
level1.level2:      [{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]
level1.level2[0]:   {"level3":"testlevel3_1"}

What do I do if data of a string type in a MaxCompute data source is enclosed in double quotation marks (") after the data is synchronized to an Elasticsearch data source? How do I configure the JSON strings read from a MaxCompute data source to be written to nested fields in an Elasticsearch data source?

  1. Data of a string type in MaxCompute is enclosed in double quotation marks (") after the data is synchronized to an Elasticsearch data source due to the data display pattern of the Kibana console. The actual data that is synchronized to the Elasticsearch data source is not enclosed in double quotation marks ("). You can run a curl command or use Postman to query the data that is synchronized to the Elasticsearch data source. Sample curl commands:

    //es7
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/_mapping'
    //es6
    curl -u username:password --request GET 'http://esxxx.elasticsearch.aliyuncs.com:9200/indexname/typename/_mapping'

    结果

  2. If you want to configure the JSON strings read from a MaxCompute data source to be written to nested fields in an Elasticsearch data source, you can set the type of the fields to which you want to write the JSON strings in the Elasticsearch data source to nested. The following example shows how to configure the name field read from a MaxCompute data source to be written to a nested field in an Elasticsearch data source.

    • Data synchronization configuration: Set the type of the field to which you want to write data in the name field to nested.同步配置

    • Data synchronization result: Data in the name field is written to a nested field in the Elasticsearch data source.同步结果

How do I configure a batch synchronization task to synchronize data such as string "[1,2,3,4,5]" from a data source to an Elasticsearch data source as an array?

If you want a batch synchronization task to synchronize data from a data source to an Elasticsearch data source as arrays, you can configure the synchronization task based on the format of the source data by using one of the following methods:

  • Configure the batch synchronization task to parse source data in the JSON format: For example, if you want to synchronize the source data [1,2,3,4,5] to an Elasticsearch data source as an array, you can configure the json_array:true setting when you configure the destination fields in the Elasticsearch data source. This way, the batch synchronization task can parse the source data as an array.

    • Configuration in the codeless UI向导模式配置

    • Configuration in the code editor

      "column":[
        {
          "name":"docs",
          "type":"keyword",
          "json_array":true
        }
      ]
  • Configure the batch synchronization task to parse source data based on a delimiter: For example, if you want to synchronize the source data 1,2,3,4,5 to an Elasticsearch data source as an array, you can specify a comma (,) as a delimiter for the batch synchronization task. This way, the batch synchronization task can parse the source data as an array based on the delimiter.

    • Limits:

      • A batch synchronization task supports only one type of delimiter. You cannot specify different delimiters for different fields that you want to write to Elasticsearch as arrays. For example, you cannot specify a comma (,) as a delimiter for the source field col1="1,2,3,4,5" and a hyphen (-) as a delimiter for the source field col2="6-7-8-9-10".

      • You can specify a regular expression as a delimiter. For example, if the value of a source field is 6-,-7-,-8+,*9-,-10, you can specify ".,." as a delimiter. The delimiter-related setting can be configured by using the codeless UI.

    • The default delimiter is "-,-".脚本模式配置

    • Configuration in the code editor

      "parameter" : {
            "column": [
              {
                "name": "col1",
                "array": true,
                "type": "long"
              }
            ],
            "splitter":","
      }

The property type of a field in a self-managed Elasticsearch index is keyword, but the type of the child property of the field is changed to keyword after the related batch synchronization task is run with the cleanup=true setting configured. Why does this happen?

#Original mappings
{
  "name":"box_label_ret",
  "properties":{
    "box_id":{
      "type":"keyword"
    }
}
#Mappings after the batch synchronization task is run with the cleanup=true setting configured
{
    "box_label_ret": {
      "properties": {
        "box_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }}}}
}
  • Cause:

    When Elasticsearch Writer writes data to Elasticsearch, Elasticsearch Writer writes data only to the mappings of the top layer of property. Elasticsearch automatically changes the type of the top layer of property to text and adds fields:keyword. The addition of the keyword field is an adaptive behavior of Elasticsearch, which does not affect the use of Elasticsearch Writer. If you do not want to use the mapping format provided by Elasticsearch, you can refer to Overview of the full and incremental synchronization feature to perform the related operation.

  • Solution:

    Configure mappings for the Elasticsearch index to which you want to write data and set the cleanup parameter to false before you run the batch synchronization task.

Each time data is written to Elasticsearch, an unauthorized request is sent, and the request fails because the username verification fails. As a result, a large number of audit logs are generated every day because all the requests are logged. What do I do?

  • Cause:

    Each time an HTTP client establishes communication with a data source, an unauthorized request is sent. After an authorization requirement is returned based on the specified authentication method, an authorized request is sent. Each time data is written to Elasticsearch, communication must be established between the HTTP client and the Elasticsearch data source. Therefore, each time data is written to Elasticsearch, an unauthorized request is sent. As a result, each data write request is recorded in audit logs.

  • Solution:

    Add the "preemptiveAuth":true setting for the related batch synchronization task in the code of the task.

Why do the settings that are configured for Elasticsearch Writer not take effect during the creation of an index?

  • Cause:

    #Incorrect configuration
    "settings": {
      "index": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      }
    }
    #Correct configuration
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  • Solution:

    The settings take effect only when an index is created. An index can be created in one of the following scenarios: No index exists, or the cleanup parameter is set to true. If the cleanup parameter is set to true, remove the index field from the settings.

How do I configure a batch synchronization task to synchronize data to fields of a date data type in an Elasticsearch data source?

You can use one of the following methods to configure the related settings:

  • Directly write data that is read from the source to fields of a date data type in the Elasticsearch data source:

    • Configure the origin:true setting to enable the batch synchronization task to directly write data that is read from the source to the Elasticsearch data source.

    • Configure the format parameter. The format parameter specifies the format properties that must be configured for mappings of fields when Elasticsearch Writer creates mappings for the fields.

      "parameter" : {
          "column": [
              {
                  "name": "col_date",
                  "type": "date",
                  "format": "yyyy-MM-dd HH:mm:ss",
                  "origin": true
              }
                ]
      }
  • Convert the time zone. If you want to enable Data Integration to convert the time zone, configure the Timezone parameter.

    "parameter" : {
        "column": [
            {
                "name": "col_date",
                "type": "date",
                "format": "yyyy-MM-dd HH:mm:ss",
                "Timezone": "UTC"
            }
              ]
    }

What do I do if a write error occurs when the type of a field is set to version in the configuration of Elasticsearch Writer?

  • Cause: The type of a field is set to version in the configuration of Elasticsearch Writer. Elasticsearch Writer does not support the type:version setting.

        "column":[
                                {
                                    "name":"id",
                                    "type":"version"
                                },
      ]
  • Solution:

    Remove the type:version setting.

What do I do if an error message is displayed when I preview field mappings of an unstructured data source?

  • Problem description:

    After Preview Data is clicked, an error message indicating that the number of bytes of the fields exceeds the upper limit is displayed, as shown in the following figure.

    问题现象

  • Cause: To prevent OOM errors, the service to which a data source belongs checks the length of fields when the service processes a data preview request. If the number of bytes of fields in a column exceeds 1,000, the error message shown in the preceding figure is displayed. This error message does not affect the running of the related batch synchronization task. You can ignore the error message and run the batch synchronization task.

    Note

    If data is read from an existing file and the network connectivity is normal, the following reasons may lead to a data preview failure:

    • The data in a single row of the file exceeds the upper limit 10 MB in size. In this case, data cannot be displayed, and an error message that is similar to the error message shown in the preceding figure is displayed.

    • The number of columns in a single row of the file exceeds the upper limit 1,000. In this case, only the first 1,000 columns can be displayed, and an error message is displayed in column 1001.

Is MongoDB Reader case-sensitive to the names of fields from which I want to synchronize data?

Yes, MongoDB Reader is case-sensitive to the names of fields from which you want to synchronize data. When you configure the fields, you must make sure that the field names you specify are the same as the field names in the MongoDB database. Otherwise, MongoDB Reader obtains null values instead of the actual values of the fields. The following code provides an example:

  • Field name in the MongoDB database:

    {
        "MY_NAME": "zhangsan"
    }
  • Field name that is specified in the configuration of the related batch synchronization task:

    {
        "column":
        [
            {
                "name": "my_name"
            }
        ]
    }

In the preceding code, the field name that is specified in the configuration of the batch synchronization task is different from the field name in the MongoDB database. In this case, an error occurs when MongoDB Reader reads data from the MongoDB database.

How do I configure a timeout period for MongoDB Reader?

You can configure the cursorTimeoutInMs parameter for MongoDB Reader. This parameter specifies the maximum period of time that can be used by the MongoDB server to execute a query. This time limit does not include the period of time that is required to transfer data. The default value of this parameter is 600000 milliseconds (10 minutes). If the total amount of data that you want to read is excessively large, the following error message may be returned: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

How do I perform operations such as filtering fields, reordering fields, and setting empty fields to null when I run a batch synchronization task to synchronize data to MaxCompute?

To perform the operations, you can configure specific settings for MaxCompute Writer. MaxCompute Writer allows you to perform operations that are not supported by MaxCompute, such as filtering fields, reordering fields, and setting empty fields to null. If you want to write data to all fields in a destination MaxCompute table, you can set the column parameter to ["*"] when you configure MaxCompute Writer by using the code editor.

For example, a destination MaxCompute table contains three fields: a, b, and c. If you want to write data only to Fields c and b, you can set the column parameter to ["c","b"] when you configure MaxCompute Writer by using the code editor. Data in the first field and the second field in the mapped source table is written to Fields c and b in the destination MaxCompute table. During data synchronization, Field a is automatically set to null.

What do I do if the number of fields that I want to write to a destination MaxCompute table is greater than the number of fields in the table?

To prevent data loss caused by redundant fields and ensure data quality and data reliability, MaxCompute Writer returns an error message if the number of fields to be written is greater than the number of fields in the destination table. For example, if a MaxCompute table contains Fields a, b, and c, MaxCompute Writer returns an error message if more than three fields are to be written to the table.

What are the precautions that I must practice when I configure partition information for a destination MaxCompute table?

MaxCompute Writer can write data to the lowest-level partitions but cannot route data to partitions based on a field. For example, if you want to write data to a MaxCompute table that contains three levels of partitions, you must specify a third-level partition, such as pt=20150101, type=1, biz=2. The data cannot be written if you specify pt=20150101, type=1 or pt=20150101.

How do I ensure the idempotence of data that is written to MaxCompute in task rerun and failover scenarios?

You can configure "truncate": true for MaxCompute Writer to ensure the idempotence of data that is written to MaxCompute when you configure the related batch synchronization task by using the code editor. If a failed batch synchronization task is rerun due to a write failure, MaxCompute Writer deletes the data that has been written to the destination MaxCompute table and writes the source data to the table again. This ensures that the data written in each rerun is the same. If a batch synchronization task is interrupted due to other exceptions, the idempotence of data cannot be ensured, the data cannot be rolled back, and the synchronization task cannot be automatically rerun. In this case, to ensure the idempotence of write operations and the data integrity, you can set the truncate parameter to true for MaxCompute Writer when you configure the batch synchronization task by using the code editor.

Note

If you set the truncate parameter to true for MaxCompute Writer, all data in the specified partition or table is deleted before a rerun. Make sure that setting the truncate parameter to true does not affect your business before you make this setting.

Does the system retain information of a source table, such as not-null properties and the default values of fields of the source table, in the mapped destination table that is automatically created?

When the system creates a destination table, the system retains only information such as the field names, data types, and remarks of the mapped source table in the destination table, but not the default values of fields and constraints, such as not-null constraints and constraints on indexes, of the mapped source table.

How do I configure Tablestore Writer to write data to a destination table that contains auto-increment primary key columns?

  1. Add the following settings to the configurations of Tablestore Writer:

    "newVersion": "true",
    "enableAutoIncrement": "true",
  2. Do not specify the names of auto-increment primary key columns in the configurations of Tablestore Writer.

  3. Make sure that the sum of the number of primary keys specified in the primaryKey parameter and the number of columns specified in the column parameter in the configurations of Tablestore Writer is the same as the number of columns that are specified in the configurations of the related reader.

How do I use _tags and is_timeseries_tag in the configurations of a time series model to read or write data?

Example: A data record has three tags: phone=xiaomi,RAM=8G,camera=LEICA.数据

  • Example for using Tablestore Reader to read data

    • If you want to combine the preceding tags and export the tags as one column, configure the following setting:

      "column": [
            {
              "name": "_tags",
            }
          ],

      DataWorks exports the tags as one column. Example:

      ["phone=xiaomi","camera=LEICA","RAM=8G"]
    • If you want to read the phone tag and the camera tag as two separate columns, configure the following setting:

      "column": [
            {
              "name": "phone",
              "is_timeseries_tag":"true",
            },
            {
              "name": "camera",
              "is_timeseries_tag":"true",
            }
          ],

      The following two columns of data are obtained:

      xiaomi, LEICA
  • Example for using Tablestore Writer to write data

    For example, the following two columns of data exist in the source:

    • ["phone=xiaomi","camera=LEICA","RAM=8G"]

    • 6499

    You want the two columns of data to be written to the _tags field in the following format.格式Configure the following setting:

    "column": [
          {
            "name": "_tags",
          },
          {
            "name": "price",
            "is_timeseries_tag":"true",
          },
        ],
    • The configuration for the first column enables ["phone=xiaomi","camera=LEICA","RAM=8G"] to be written to the tag field.

    • The configuration for the second column enables price=6499 to be written to the tag field.

How do I retain the millisecond part or a specified custom date or time format when I use a batch synchronization task to synchronize data of the DATE or TIME type to text files?

Switch from the codeless UI to the code editor, and add the following configuration in the setting parameter in the code of the synchronization task.

"common": {
  "column": {
    "dateFormat": "yyyyMMdd",
    "datetimeFormatInNanos": "yyyyMMdd HH:mm:ss.SSS"
  }
}

  • dateFormat specifies the date format that is used when data of the DATE type in the source is written to text files. Do not include the hour, minute, and second parts in the date format when you configure this parameter.

  • datetimeFormatInNanos specifies the date and time format that is used when data of the DATETIME or TIMESTAMP type in the source is written to text files. Include the hour, minute, and second parts in the date and time format when you configure this parameter. The date and time format can be accurate to the millisecond.

Error messages

What do I do if the error message [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots is returned when I run a batch synchronization task to synchronize data?

  • Possible cause:

    The number of parallel threads for the synchronization task is set to an excessively large value, and the resources are not sufficient to run the synchronization task.

  • Solution:

    Reduce the number of parallel threads for the synchronization task.

What do I do if the error message OutOfMemoryError: Java heap space is returned when I run a batch synchronization task to synchronize data?

Solutions:

  1. If the reader or writer that you use supports the batchsize or maxfilesize parameter, set the batchsize or maxfilesize parameter to a smaller value.

    You can check whether a reader or writer supports the batchsize or maxfilesize parameter. For more information, see Supported data source types, readers, and writers.

  2. Reduce the number of parallel threads for the synchronization task.

  3. If you synchronize data from files, such as Object Storage Service (OSS) objects, reduce the number of files from which you want to read data.

What do I do if the error message AccessDenied The bucket you access does not belong to you is returned when I run a batch synchronization task to synchronize data from OSS?

The user that is configured for the OSS data source and has the AccessKey pair does not have permissions to access the related bucket. Grant the user the permissions to access the bucket.

What do I do if the error message Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid! is returned for storing data written to Redis in hash mode?

  • Cause:

    If you want to store data in Redis in hash mode, make sure that attributes and values are generated in pairs. For example, you configure odpsReader: "column":[ "id", "name", "age", "address", ] for the source and RedisWriter: "keyIndexes":[ 0 ,1] for Redis. In this case, id and name are used as keys, age is used as an attribute, and address is used as a value in Redis. If the source is a MaxCompute data source and only two fields are configured, you cannot store the source data to the Redis cache in hash mode. If you store the source data to the Redis cache in hash mode, an error is reported.

  • Solution:

    If you want to synchronize data from only two fields, you must store data in Redis in string mode. If you want to store data in hash mode, you must specify at least three source fields.

What do I do if the error message org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery is returned when I run a batch synchronization task to synchronize data from PostgreSQL?

  • Problem description: The error message org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery is returned when I run a batch synchronization task to synchronize data from PostgreSQL.

  • Cause: This error occurs because the system takes a long period of time to obtain data from the related PostgreSQL database. To resolve this issue, specify larger values for the max_standby_archive_delay and max_standby_streaming_delay parameters. For more information, see Standby Server Events.

What do I do if the error message Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout、' on the server is returned when I run a batch synchronization task to read data from or write data to ApsaraDB RDS for MySQL?

  • Causes:

    • If the error message contains the net_read_timeout parameter, the execution time of an SQL statement exceeds the maximum execution time allowed by ApsaraDB RDS for MySQL. The SQL statement is one of the multiple SQL statements obtained after a single data acquisition SQL statement is equally split based on the setting of the shard key when you run the batch synchronization task to read data from the ApsaraDB RDS for MySQL data source.

    • If the error message contains the net_write_timeout parameter, the timeout period in which the system waits for a block to be written to a data source is too small.

  • Solution:

    Add the net_write_timeout or net_read_timeout parameter to the URL that is used to connect to the ApsaraDB RDS for MySQL database and set the parameter to a larger value, such as jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000. You can also adjust the value of the net_write_timeout or net_read_timeout parameter in the ApsaraDB RDS console.

  • Suggestion:

    If the batch synchronization task can be rerun, configure the related settings to enable the system to automatically rerun the synchronization task.

数据源参数设置

What do I do if the error message [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the specified table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed is returned when I run a batch synchronization task to synchronize data to a MySQL data source?

Cause:

The batch synchronization task still tries to read data from the source after the period of time indicated by the default value of the wait_timeout parameter is exceeded. The default value of the wait_timeout parameter is 28800 seconds (8 hours).

Solution:

Modify the configuration file my.cnf of the MySQL server. Set the wait_timeout and interactive_timeout parameters in the configuration file to 2592000 seconds. Then, restart the MySQL server, and execute the show variables like ‘%wait_time%’ statement on the MySQL server to check whether the settings take effect. If you use the Windows operating system, the name of the configuration file of the MySQL server is my.ini.

What do I do if the error message The last packet successfully received from the server was 902,138 milliseconds ago is returned when I run a batch synchronization task to read data from a MySQL data source?

If the CPU utilization is normal but the memory usage is high, the data source may be disconnected from Data Integration.

If the batch synchronization task can be automatically rerun, we recommend that you configure the related settings to enable the system to automatically rerun the synchronization task after an error occurs. For more information, see Configure time properties.

What do I do if the error message Communications link failure is returned when I run a batch synchronization task to read data from or write data to a MySQL data source?

  • Read data from a MySQL data source:

    • Problem description:

      The following error message is returned when I run a batch synchronization task to read data from a MySQL data source: Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure.

    • Possible cause:

      Slow SQL queries in the related MySQL database result in timeout when you read data from the MySQL data source.

    • Solutions:

      • Check whether the WHERE clause is specified to ensure that an index is added for the filter field.

      • Check whether a large amount of data exists in the source tables. If a large amount of data exists in the source tables, we recommend that you run multiple synchronization tasks to read data.

      • Check the database logs to find the SQL queries that are blocked and contact the database administrator to resolve the issue.

  • Write data to a MySQL data source:

    • Problem description:

      The following error message is returned when I run a batch synchronization task to write data to a MySQL data source: Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More....

    • Possible cause:

      A socket timeout occurred due to slow SQL queries. The default value of the SocketTimeout parameter of Taobao Distributed Data Layer (TDDL) connections is 12 seconds. If the execution time of an SQL statement on a MySQL data source exceeds 12 seconds, a TDDL-4614 error is reported. This error occasionally occurs when the data volume is large or the server is busy.

    • Solutions:

      • Rerun the batch synchronization task after the database becomes stable.

      • Contact the database administrator to adjust the value of the SocketTimeout parameter.

What do I do if the error message Duplicate entry 'xxx' for key 'uk_uk_op' is returned when I run a batch synchronization task to synchronize data?

  • Problem description: The error message Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op' is returned when I run a batch synchronization task to synchronize data.

  • Cause: In Data Integration, different instances of the same data synchronization task cannot be run at the same time. Therefore, multiple synchronization instances that are configured based on the same JSON settings cannot be run at the same time. For a data synchronization task whose instances are run at 5-minute intervals, the instance that is scheduled to run at 00:00 and the instance that is scheduled to run at 00:05 are both run at 00:05 due to the delay of the ancestor task of the data synchronization task. As a result, one of the instances fails to be run. This issue may occur if you backfill data for or rerun a data synchronization task that is running.

  • Solution: Adjust the time at which the instances are run and make sure that the instances are run at different points in time. We recommend that you configure the self-dependency for the instances generated for tasks that are scheduled by hour. For more information, see Scenario 2: Configure scheduling dependencies for a task that depends on last-cycle instances.

What do I do if the error message plugin xx does not specify column is returned when I run a batch synchronization task to synchronize data?

A possible cause is that field mappings are not correctly configured or fields are not correctly configured for the related reader or writer.

  1. Check whether the mappings between the source fields and destination fields are correctly configured.

  2. Check whether the fields are correctly configured for the related reader or writer.

What do I do if the error message The download session is expired is returned when I run a batch synchronization task to read data from a MaxCompute table?

  • Problem description:

    The following error message is returned when I read data from a MaxCompute table: Code:DATAX_R_ODPS_005:Failed to read data from a MaxCompute table, Solution:[Contact the administrator of MaxCompute]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired.

  • Possible cause:

    If you want to read data from a MaxCompute table, you must run a Tunnel command in MaxCompute to upload or download data. The lifecycle for each Tunnel session on the server spans 24 hours after the session is created. If a batch synchronization task is run for more than 24 hours, the synchronization task fails and exits. For more information about the Tunnel service, see Usage notes.

  • Solution:

    Appropriately increase the number of parallel threads for the synchronization task and plan the amount of data that you want to synchronize to ensure that the data can be synchronized within 24 hours.

What do I do if the error message Error writing request body to server is returned when I run a batch synchronization task to write data to a MaxCompute table?

  • Problem description:

    The following error message is returned when I write data to a MaxCompute table: Code:[OdpsWriter-09], Description:[Failed to write data to the destination MaxCompute table.]. - Failed to write Block 0 to the destination MaxCompute table, uploadId=[202012081517026537dc0b0160354b]. Contact the administrator of MaxCompute. - java.io.IOException: Error writing request body to server.

  • Possible causes:

    • Possible cause 1: The data type is incorrect. The data type of source data does not comply with the data type specifications of MaxCompute. For example, the value 4.2223 cannot be written to the destination MaxCompute table in the format of DECIMAL(precision,scale), such as DECIMAL(18,10).

    • Possible cause 2: The MaxCompute block is abnormal, or the communication is abnormal.

  • Solution:

    Convert the data type of the data that is to be synchronized into a data type that is supported by the destination.

What do I do if the JSON data returned based on the path:[] condition is not of the ARRAY type when I use RestAPI Writer to write data?

RestAPI Writer provides the dataMode parameter that specifies the write mode. The dataMode parameter can be set to oneData or multiData. If you want to use RestAPI Writer to write multiple data records at a time, set dataMode to multiData. If you set this parameter to multiData, you must also add the dataPath:"data.list" configuration to the script of RestAPI Writer. For more information about RestAPI Writer, see RestAPI Writer.参数

Important

Do not prefix a column name with data.list when you configure the column parameter.

What do I do if the error message Host is blocked is returned when I run a batch synchronization task to synchronize data from an Amazon RDS data source?

Disable the health check for load balancing feature for the Amazon RDS data source.

What do I do if the error message no master is returned when I run a batch synchronization task to synchronize data from a MongoDB data source?

DataWorks does not support data synchronization from a secondary instance. If you configure a secondary instance as the source of a batch synchronization task, this error is reported.

What do I do if the error message MongoExecutionTimeoutException: operation exceeded time limit is returned when I run a batch synchronization task to synchronize data from a MongoDB data source?

  • Cause:

    This error occurs due to cursor timeout.

  • Solution:

    Specify a larger value for the cursorTimeoutInMs parameter.

What do I do if the error message DataXException: operation exceeded time limit is returned when I run a batch synchronization task to synchronize data from a MongoDB data source?

Increase the number of parallel threads and the number of data records that can be read at a time for the batch synchronization task.

What do I do if the error message ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value: is returned when I run a batch synchronization task to synchronize data from an Elasticsearch data source?

  • Cause:

    Elasticsearch Reader cannot parse the date format of fields of a date data type because the format properties are not configured for the mappings of some fields of a date data type in the Elasticsearch data source.

  • Solutions:

    • Configure the dateFormat parameter for Elasticsearch Reader in the same format as the format properties that are configured for the mappings of the fields of a date data type in the Elasticsearch data source. The date formats that are specified in the dateFormat parameter must be separated by double vertical bars (||). The format properties must include the formats of the date data types that you use. Sample code:

      "parameter" : {
            "column": [
           			"dateCol1",
              	"dateCol2",
                "otherCol"
            ],
           "dateFormat" : "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss",
      }
    • Configure format properties for the mappings of all fields of a date data type in the Elasticsearch data source.

What do I do if the error message com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[Configuration file contains invalid information. Check the configuration.] - Invalid configuration information. Configuration information is not of the JSON format: illegal ref, [ . Provide configuration information in the JSON format is returned when I run a batch synchronization task to synchronize data from an Elasticsearch data source?

  • Cause:

    Keywords such as $ref may exist in an Elasticsearch index or field due to the limits of keywords supported by Fastjson. As a result, the error may be returned.

  • Solution:

    Elasticsearch Reader does not support synchronization of data from Elasticsearch indexes that contain fields with the keyword $ref in field names. For more information, see Elasticsearch Reader.

What do I do if the error message version_conflict_engine_exception is returned when I run a batch synchronization task to synchronize data to an Elasticsearch data source?

  • Cause:

    The optimistic locking mechanism of Elasticsearch is triggered due to a version conflict. The version conflict occurs because the current version number of the Elasticsearch data source and the version number specified in an update command are different. When the version number of the Elasticsearch data source is updated, data in an index in the Elasticsearch data source is deleted.

  • Solutions:

    1. Check whether a data deletion operation is performed.

    2. Change the type of action that can be performed when data is written to the Elasticsearch data source from update to indexing.

What do I do if the error message illegal_argument_exception is returned when I run a batch synchronization task to synchronize data to an Elasticsearch data source?

  • Cause:

    The other_params parameter is not configured for fields for which advanced properties, such as similarity and properties, are configured. As a result, Elasticsearch Writer cannot identify the advanced properties that are configured.原因

  • Solution:

    Configure the other_params parameter for the fields and configure the advanced property similarity in the other_params parameter.

    {"name":"dim2_name",...,"other_params":{"similarity":"len_similarity"}}

What do I do if the error message dense_vector is returned when I run a batch synchronization task to synchronize data from fields of an array data type in a MaxCompute data source to an Elasticsearch data source?

  • Cause:

    The fields in the Elasticsearch data source to which you want to write data are of the dense_vector type. You cannot use a batch synchronization task to synchronize data to fields of the dense_vector type in Elasticsearch. Supported data types:

    ID,PARENT,ROUTING,VERSION,STRING,TEXT,KEYWORD,LONG,
    INTEGER,SHORT,BYTE,DOUBLE,FLOAT,DATE,BOOLEAN,BINARY,
    INTEGER_RANGE,FLOAT_RANGE,LONG_RANGE,DOUBLE_RANGE,DATE_RANGE,
    GEO_POINT,GEO_SHAPE,IP,IP_RANGE,COMPLETION,TOKEN_COUNT,OBJECT,NESTED;
  • Solutions:

    • Use self-managed mappings. We recommend that you do not use mappings that are automatically created by Elasticsearch Writer for the related Elasticsearch index.

    • Change the data type of the fields of an array data type to NESTED.

    • Configure the following settings: dynamic=true and cleanup=false.

What do I do if the error message Could not get block locations is returned when I run a batch synchronization task to synchronize data to an on-premises Hive data source?

  • Cause:

    The value of the mapred.task.timeout parameter is excessively small. As a result, Hadoop stops the batch synchronization task and clears data from temporary directories. Temporary data can no longer be found.

  • Solution:

    Specify a larger value for the mapred.task.timeout parameter.

What do I do if the error message "no such cmd splitVector" is returned when I run a batch synchronization task to synchronize data from MongoDB?

  • Possible cause:

    When a batch synchronization task is run, the synchronization task preferentially runs the splitVector command to split the synchronization task into slices. However, some MongoDB versions do not support the splitVector command. If you run this command in a synchronization task that is used to synchronize data from a MongoDB database of one of the versions, the no such cmd splitVector error occurs.

  • Solution:

    1. Go to the configuration tab of the batch synchronization task and click the 转换脚本 icon in the top toolbar to switch to the code editor.

    2. Add the following setting to the configuration of the batch synchronization task to prevent the splitVector command from being used:

      "useSplitVector" : false

What do I do if the following error message is returned when I run a batch synchronization task to synchronize data to MongoDB: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"?

  • Problem description:

    When I configure a batch synchronization task by using the codeless UI, I set WriteMode(overwrite or not) to Yes and specify a field other than the _id field as the primary key. When I run the batch synchronization task, this error occurs.写入模式报错

  • Possible cause:

    For some of the data that is written to the destination table, the value of the _id field does not match the primary key that you specify. In the preceding figure, the primary key is my_id.

  • Solutions:

    • Solution 1: Modify the configurations of the batch synchronization task to ensure that the primary key is the same as the value of the _id field.

    • Scenario 2: Set the _id field as the primary key for the batch synchronization task.

When I run my batch synchronization task to synchronize data, the following error message is returned: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit. What do I do?

  • Problem description: When I run my batch synchronization task to synchronize data, an error occurs, and the following error message is returned: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

  • Cause: The amount of data that you want to synchronize is excessively large.

  • Solutions:

    • Increase the number of parallel threads for the batch synchronization task.

    • Reduce the value of the BatchSize parameter.

    • In the parameters for the reader of the batch synchronization task, add the cursorTimeoutInMs parameter and specify a large value for the parameter, such as 3600000 milliseconds.