This topic provides answers to some frequently asked questions about batch synchronization.

Why is the connectivity test of a data source successful, but the corresponding batch synchronization node fails to be run?

  • If the data source has passed the connectivity test, you can test the connectivity again to make sure that the resource group that you use is connected to the data source and the data source remains unchanged.
  • Check whether the resource group that is connected to the data source is the same as the resource group that you use to run a batch synchronization node.
    Check the resource that is used to run a node:
    • If the node is run on the shared resource group for Data Integration, the log contains the following information: running in Pipeline[basecommon_ group_xxxxxxxxx].
    • If the node is run on a custom resource group for Data Integration, the log contains the following information: running in Pipeline[basecommon_xxxxxxxxx].
    • If the node is run on an exclusive resource group for Data Integration, the log contains the following information: running in Pipeline[basecommon_S_res_group_xxx].
  • If the node that is scheduled to run in the early morning occasionally fails but reruns successfully, check the load of the data source at the time when the failure occurred.

How do I change the resource group that is used to run a Data Integration node?

  • Method 1: Change the resource group that is used to run a Data Integration node on the Cycle Task page in Operation Center. Change resource group 1
  • Method 2: Change the resource group that is used to run a Data Integration node on the Resource Group configuration tab in DataStudio. Change resource group 2
    Note If you use method 2, you must commit and deploy the node to make the change take effect.

How do I locate and handle dirty data?

Definition: If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Therefore, data records that fail to be written to the destination are considered as dirty data.

Impact: Dirty data fails to be written to the destination. You can control whether dirty data can be generated and the maximum number of dirty data records that can be generated. By default, dirty data is allowed in Data Integration. You can specify the maximum number of dirty data records that can be generated when you configure a synchronization node. For more information, see Configure channel control policies.
  • Dirty data is allowed in a synchronization node: If a dirty data record is generated, the synchronization node continues to run. However, the dirty data record is discarded and is not written to the destination.
  • The maximum number of dirty data records that can be generated is specified in a synchronization node:
    • If you set the maximum number of dirty data records that can be generated to 0, the synchronization node fails and exits when a dirty data record is generated.
    • If you set the maximum number of dirty data records that can be generated to x, the synchronization node fails and exits when the number of dirty data records exceeds x. The synchronization node continues to run if the number of dirty data records is less than x. However, the dirty data records are discarded and are not written to the destination.

Analysis of dirty data generated during data synchronization:

  • Problem description: {"message":"Dirty data records are detected when data is written to the destination MaxCompute table: The [third] field contains dirty data. Check and correct the data, or increase the threshold value and ignore this dirty data record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}.
  • The logs show that the third field contains dirty data. You can identify the cause of dirty data based on the following scenarios:
    • If dirty data is reported by a writer, you must check the CREATE TABLE statement of the writer. The data size of the specified field in the destination MaxCompute table is less than the data size of the same field in the source MySQL table.
    • If you want to write data from the source to the destination, the following requirements must be met: 1. The data type in source columns must match the data type in destination columns. For example, data of the VARCHAR type in source columns cannot be written to the destination columns that contain data of the INT type. 2. The size of data defined by the data type of destination columns must be sufficient to receive the data in the mapping columns in the source. For example, you can write data of the LONG, VARCHAR, or DOUBLE type from the source to the columns that contain data of the STRING or TEXT type.
    • If a dirty data error is not clear, you must copy and print out dirty data records, observe the data, and then compare the data type of the records with the data type in destination columns to identify dirty data records.
    Example:
    {"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}
    byteSize: the number of bytes. index:25: the 26th field. rawData: a specific value. type: the data type.

How do I handle a dirty data error that is caused by encoding format configuration issues or garbled characters?

  • Problem description:

    If data contains emoticons, a dirty data error message similar to the following error message may be returned when you synchronize the data: [13350975-0-0-writer] ERROR StdoutPluginCollector - dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"},}],"type":"writer"} .

  • Cause:
    • utf8mb4 is not configured for a data source. As a result, an error is reported when data that contains emoticons is synchronized.
    • Data in the source contains garbled characters.
    • The encoding format is different between a data source and a synchronization node.
    • The encoding format of the browser is different from the encoding format of the data source or synchronization node. As a result, the preview fails or the previewed data contains garbled characters.
  • Solution:

    The solution varies based on the cause.

    • If data in the source contains garbled characters, process the data before you run a synchronization node.
    • If the encoding format of the data source is different from the encoding format of the synchronization node, modify the configuration for the encoding format of the data source to be the same as the encoding format of the synchronization node.
    • If the encoding format of the browser is different from the encoding format of the data source or synchronization node, modify the configuration for the encoding format of the browser and make sure that the encoding format is the same as the encoding format of the data source and synchronization node. Then, preview the data.
    You can perform the following operations:
    1. If you add a data source by using a Java Database Connectivity (JDBC) URL, set the encoding format to utf8mb4. JDBC URL sample: jdbc:mysql://xxx.x.x.x:3306/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
    2. If you add a data source by using an instance ID, suffix the data source name with the encoding format, such as database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
    3. Change the encoding format of the data source to utf8mb4. For example, you can change the encoding format of the ApsaraDB RDS data source in the ApsaraDB RDS console.
      Note Run the following command to set the encoding format of the Apsara RDS data source to utf8mb4: set names utf8mb4. Run the following command to view the encoding format of the Apsara RDS data source: show variables like 'char%'.

What do I do if the error message [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots. is returned?

  • Cause:

    The number of nodes that are run in parallel is set to an excessively large value and the resources are not sufficient to run the nodes.

  • Solution:
    Reduce the number of batch synchronization nodes that are run in parallel.
    • If you configure a batch synchronization node by using the codeless user interface (UI), set Expected Maximum Concurrency to a smaller value in the Channel step. For more information, see Configure channel control policies.
    • If you configure a batch synchronization node by using the code editor, set concurrent to a smaller value when you configure the channel control policies. For more information, see Configure channel control policies.

What do I do if a server-side request forgery (SSRF) attack is detected in a node?

If the data source is added by using a virtual private cloud (VPC) address, you cannot use the shared resource group for Data Integration to run a node. Instead, you can use an exclusive resource group for Data Integration to run the node. You can also change the VPC address to a public network address for the data source before you use the shared resource group for Data Integration to run the node.

What do I do if the error message OutOfMemoryError: Java heap space is returned when I run a batch synchronization node?

Solution:
  1. If you use an exclusive resource group for Data Integration to run a node, you can adjust the values of the Java Virtual Machine (JVM) parameters.
  2. If the reader or writer that you use supports the batchsize or maxfilesize parameter, set the batchsize or maxfilesize parameter to a smaller value.

    If you want to check whether a reader or writer supports the batchsize or maxfilesize parameter, see Supported data sources, readers, and writers.

  3. Reduce the number of nodes that are run in parallel.
    • If you configure a batch synchronization node by using the codeless user interface (UI), set Expected Maximum Concurrency to a smaller value in the Channel step. For more information, see Configure channel control policies.
    • If you configure a batch synchronization node by using the code editor, set concurrent to a smaller value when you configure the channel control policies. For more information, see Configure channel control policies.
  4. If you synchronize files, such as Object Storage Service (OSS) files, reduce the number of files that you want to read.

What do I do if the same batch synchronization node fails to be run occasionally?

If a batch synchronization node occasionally fails to be run, a possible cause is that the whitelist configuration of the data source for the node is incomplete.
  • Use an exclusive resource group for Data Integration to run the batch synchronization node:
    • If you have added the IP address of the ENI (Elastic Network Interface) of the exclusive resource group for Data Integration to the whitelist of the data source, when the resource group is scaled out, you must add the ENI IP address to the whitelist again to update the whitelist.
    • We recommend that you directly add the CIDR block of the vSwitch to which the exclusive resource group for Data Integration is bound to the whitelist of the data source. Otherwise, you must update the ENI IP address each time the resource group is scaled out. For more information, see Configure a whitelist.
  • Use the shared resource group for Data Integration to run the batch synchronization node:

    Make sure that all the CIDR blocks of the machines that are used for data synchronization in the region where the shared resource group for Data Integration resides are added to the whitelist of the data source. For more information, see Add the IP addresses or CIDR blocks of the servers in the region where the DataWorks workspace resides to the whitelist of a data source.

If the configuration of the whitelist for the data source is complete, check whether the connection between the data source and Data Integration is interrupted due to the heavy load of the data source.

What do I do if an error occurs when I add a MongoDB data source as the root user?

Change the username. You must use the name of the user that has operation permissions on the data source instead of the root user.

For example, if you want to synchronize data of the name table in the test data source, use the name of the user that has operation permissions on the test data source.

The authDB database used by MongDB is the admin database. How do I synchronize data from business databases?

Enter the name of a business database when you configure a data source to make sure that the user that you use has the required permissions on the business database. If the error message "auth failed" is returned when you test the connectivity of the data source, ignore the error message. If you configure a synchronization node by using the code editor, add the "adthDb":"admin" parameter to the JSON configurations of the synchronization node.

How do I convert the values of the variables in the query parameter into values in the timestamp format when I synchronize incremental data from a table of a MongDB database?

Use assignment nodes to convert data of the DATE type into data of the TIMESTAMP format and use the timestamp value as an input parameter for data synchronization from MongDB. For more information, see How do I synchronize incremental data that is in the timestamp format from a table of a MongDB database?

What do I do if the error message AccessDenied The bucket you access does not belong to you. is returned when I read data from an OSS bucket?

The user that is configured for OSS and has the AccessKey pair does not have permissions to access the bucket. Grant the user permissions to access the bucket.

Is an upper limit configured for the number of OSS objects that can be read?

In Data Integration, the number of OSS objects that can be read from OSS by OSS Reader is not limited. The maximum number of OSS objects that can be read is determined by the JVM parameters that are configured for a synchronization node. To prevent out of memory (OOM) errors, we recommend that you do not use asterisks (*) in object parameters.

What do I do if the error message Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid! is returned when I write data to Redis in hash mode?

  • Cause:

    If you want to store data in Redis in hash mode, make sure that attributes and values are generated in pairs. Example: odpsReader: "column":[ "id", "name", "age", "address", ]. In Redis, if redisWriter: "keyIndexes":[ 0 ,1] is used, id and name are used as keys, age is used as an attribute, and address is used as a value in Redis. If the source is MaxCompute and only two columns are configured, you cannot use the hash mode to store the Redis cache. Otherwise, an error is reported.

  • Solution:

    If you want to use only two columns, you must store data in Redis by using the string mode. If you need to store data in hash mode, you must configure at least three columns in the source.

What do I do if the following error message is returned when I read data from or write data to MySQL: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout,' on the server.?

  • Cause:
    • net_read_timeout: If the error message contains this parameter, the execution time of an SQL statement exceeded the maximum execution time allowed by ApsaraDB RDS for MySQL. The SQL statement is one of the multiple SQL statements that are obtained after a single data acquisition SQL statement is equally split based on the splitpk parameter when you run a synchronization node to read data from the MySQL data source.
    • net_write_timeout: If the error message contains this parameter, the timeout period in which the system waits for a block to be written to a data source is too small.
  • Solution:

    Add the net_write_timeout or net_read_timeout parameter to the URL of the ApsaraDB RDS for MySQL database and set the parameter to a larger value. You can also set the net_write_timeout or net_read_timeout parameter to a larger value in the ApsaraDB RDS console.

  • Suggestion:

    If possible, configure the synchronization node to be rerun automatically.

Example: jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000

What do I do if the error message The last packet successfully received from the server was 902,138 milliseconds ago is returned when I read data from MySQL?

In this case, the CPU utilization is normal but the memory usage is high. As a result, the data source is disconnected from Data Integration.

If you confirm that the synchronization node can be rerun automatically, we recommend that you configure the node to be automatically rerun if an error occurs. For more information, see Configure time properties.

What do I do if an error occurs when I read data from PostgreSQL?

  • Problem description: The error message org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery is returned when I use a batch synchronization tool to synchronize data from PostgreSQL.
  • Cause: This error occurs because the system takes a long time to obtain data from the PostgreSQL database. To resolve this issue, specify the max_standby_archive_delay and max_standby_streaming_delay parameters in the code of the synchronization node. For more information, see Standby Server Events.

What do I do if the error message Communications link failure is returned?

  • Read data from a data source:
    • Problem description:

      The following error message is returned when data is read from a data source: Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure.

    • Cause:

      Slow SQL queries result in timeout when you read data from MySQL.

    • Solution:
      • Check whether the WHERE clause is specified to make sure that an index is added for the filter field.
      • Check whether a large amount of data exists in the source table. If a large amount of data exists in the source table, we recommend that you run multiple nodes to execute the SQL queries.
      • Check the database logs to find which SQL queries are delayed and contact the database administrator to resolve the issue.
  • Write data to a data source:
    • Problem description:

      The following error message is returned when data is written to a data source: Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More....

    • Cause:

      A socket timeout occurred due to slow SQL queries. The default value of the SocketTimeout parameter of Taobao Distributed Data Layer (TDDL) connections is 12 seconds. If the execution time of an SQL statement on a MySQL client exceeds 12 seconds, a TDDL-4614 error is returned. This error occasionally occurs when the data volume is large or the server is busy.

    • Solution:
      • We recommend that you rerun the synchronization node after the database becomes stable.
      • Contact the database administrator to adjust the value of the SocketTimeout parameter.

What do I do if a synchronization node fails to be run because the name of a field in the source table is a keyword?

  • Cause: The column parameter contains reserved fields or fields whose names start with a number.
  • Solution: Use the code editor to configure a synchronization node in Data Integration and escape special fields in the configuration of the column parameter. Field conflict

What do I do if the error message Duplicate entry 'xxx' for key 'uk_uk_op' is returned when I run a batch synchronization node?

  • Problem description: Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op'
  • Possible cause: In Data Integration, different instances of the same synchronization node cannot be run at the same time. Therefore, multiple synchronization instances that are configured based on the same JSON configurations cannot be run at the same time. For a synchronization node whose instances are run at 5-minute intervals, the instance that is scheduled to run at 00:00 and the instance that is scheduled to run at 00:05 are both run at 00:05 due to a delay caused by the ancestor node of the synchronization node. As a result, one of the instances fails to be run. In this case, you must add a data backfill node or rerun the node instance that failed to be run.
  • Solution: Stagger the running time of instances. We recommend that you configure nodes that are scheduled to run by hour to depend on their instances in the last cycle. For more information, see Scenario 2: Configure scheduling dependencies for a node that depends on last-cycle instances.

What do I do if the error message plugin xx does not specify column is returned when I run a batch synchronization node?

A possible cause is that the field mapping for the batch synchronization node is incorrect or the column parameter is incorrectly configured in a reader or writer.

  1. Check whether the mapping between the source fields and the destination fields is configured.
  2. Check whether the column parameter is configured in a reader or writer based on your business requirements.

What do I do if the error message The download session is expired. is returned when I read data from a MaxCompute table?

  • Problem description:

    Code:DATAX_R_ODPS_005:Failed to read data from a MaxCompute table, Solution:[Contact the administrator of MaxCompute]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired.

  • Cause:

    If you want to read data from a MaxCompute table, you must run a Tunnel command in MaxCompute to upload and download data. On the server, the lifecycle for each Tunnel session spans 24 hours after the session is created. If a batch synchronization node is run for more than 24 hours, it fails to be run and exits. For more information about the Tunnel service, see Usage notes.

  • Solution:

    You can increase the number of batch synchronization nodes that can be run in parallel or configure the volume of data to be synchronized to make sure that the volume of data can be synchronized within 24 hours.

What do I do if the error message Error writing request body to server is returned when I write data to a MaxCompute table?

  • Problem description:

    Code:[OdpsWriter-09], Description:[Failed to write data to the destination MaxCompute table.]. - Write data to the destination MaxCompute table block:0 failed, uploadId=[202012081517026537dc0b0160354b]. Contact the administrator of MaxCompute. - java.io.IOException: Error writing request body to server.

  • Cause:
    • Cause 1: The data type is incorrect. The source data does not comply with MaxCompute data type specifications. For example, the value 4.2223 cannot be written to the destination MaxCompute table in the format of DECIMAL(precision,scale), such as DECIMAL(18,10).
    • Cause 2: The MaxCompute block is abnormal or the communication is abnormal.
  • Solution:

    Convert the data type of the data that is to be synchronized to a data type that is supported by the destination. If an error is still reported after you convert the data type, you can submit a ticket for troubleshooting.

What do I do if data fails to be written to DataHub because the amount of data that I want to write to DataHub at a time exceeds the upper limit?

  • Problem description:

    ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Failed to write data to DataHub.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42)

  • Cause:
    The amount of data that you want to write to DataHub at a time exceeds the upper limit that is allowed by DataHub. The following parameters specify the maximum amount of data that can be written to DataHub:
    • maxCommitSize: specifies the maximum amount of the buffered data that Data Integration can accumulate before it commits the data to the destination. Unit: MB. The default value is 1048576, in bytes, which is 1 MB.
    • batchSize: specifies the maximum number of the buffered data records that a single synchronization task can accumulate before it commits the data records to the destination.
  • Solution:

    Set the maxCommitSize and batchSize parameters to smaller values.

How do I add prefixes to table names when I synchronize data from the tables to the destination at a time?

If you want to add the prefix AAA_ to the name of a source table, use AAA_${db_table_name_sr}.

How do I customize table names in a batch synchronization node?

The tables from which you want to synchronize data are named in a consistent format. For example, the tables are named by date and the table schema is consistent, such as orders_20170310, orders_20170311, and orders_20170312. You can specify custom table names by using the scheduling parameters specified in Create a sync node by using the code editor. This way, the synchronization node automatically reads table data of the previous day from the source every morning.

For example, if the current day is March 15, 2017, the synchronization node can automatically read data of the orders_20170314 table from the source. Customize a table name

In the code editor, use a variable to specify the name of a source table, such as orders_${tablename}. The tables are named by date. If you want the synchronization node to read data of the previous day from the source every day, assign the value ${yyyymmdd} to the ${tablename} variable in the parameter configurations of the synchronization node.

Note

For more FAQ about how to use scheduling parameters, see Configure scheduling parameters.

What do I do if the table that I want to select does not appear in the Table drop-down list of the Source section when I configure a batch synchronization node?

When you configure a batch synchronization node, the Table drop-down list in the Source section displays only the first 25 tables in the selected data source by default. If the selected data source contains more than 25 tables and the table that you want to select does not appear in the Table drop-down list, enter the name of the table in the Table field. You can also configure the batch synchronization node in the code editor.

What are the items that I must take note of when I use the Add Row feature in a synchronization node that reads data from the MaxCompute table?

  1. You can enter constants. Each constant must be enclosed in a pair of single quotation marks ('), such as 'abc' and '123'.
  2. You can use the Add Row feature together with scheduling parameters, such as '${bizdate}'. For more information about how to use scheduling parameters, see Configure scheduling parameters.
  3. You can specify the partition key columns from which you want to read data, such as the partition key column pt.
  4. If the field that you entered cannot be parsed, the value of Type for the field is Custom.
  5. MaxCompute functions are not supported.
  6. If the columns that you manually added are indicated as Custom, such as the partition key columns of MaxCompute tables, synchronization nodes can still be run although the partition key columns cannot be previewed in LogHub.

How do I read data in partition key columns from a MaxCompute table?

Add a data record in the field mapping configuration area, and specify the name of a partition key column, such as pt.

How do I synchronize data from multiple partitions of a MaxCompute table?

Locate the partitions from which you want to read data.
  • You can use Linux Shell wildcards to specify the partitions. An asterisk (*) indicates zero or multiple characters, and a question mark (?) indicates a single character.
  • The partitions that you specify must exist in the source table. Otherwise, the system reports an error for the synchronization node. If you want the synchronization node to be successfully run even if the partitions that you specify do not exist in the source table, use the code editor to modify the code of the node. In addition, you must add "successOnNoPartition": true to the configuration of MaxCompute Reader.
For example, the partitioned table test contains four partitions: pt=1,ds=hangzhou, pt=1,ds=shanghai, pt=2,ds=hangzhou, and pt=2,ds=beijing. In this case, you can set the partition parameter based on the following instructions:
  • To read data from the partition pt=1,ds=hangzhou, specify "partition":"pt=1,ds=hangzhou".
  • To read data from all the ds partitions in the pt=1 partition, specify "partition":"pt=1,ds=*".
  • To read data from all the partitions in the test table, specify "partition":"pt=*,ds=*".
You can also perform the following operations in the code editor to specify other conditions based on which data is read from partitions.
  • To read data from the partition that stores the largest amount of data, add /*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR) to the configuration of MaxCompute Reader.
  • To filter data based on filter conditions, add /*query*/ pt+Expression to the configuration of MaxCompute Reader. For example, /*query*/ pt>=20170101 and pt<20170110 indicates that you want to read the data that is generated from January 1, 2017 to January 9, 2017 from all the pt partitions in the test table.
Note

MaxCompute Reader processes the content that follows /*query*/ as a WHERE clause.

What do I do if a synchronization node fails to be run because the name of a column in the source table is a keyword?

  • Problem description:

    A synchronization node fails to be run because the name of a column in the source table is a keyword.

  • Solution:
    Add escape characters to keywords. For more information about how to use the code editor to configure a synchronization node, see Create a sync node by using the code editor. You can add escape characters when you configure the column parameter.
    • MySQL uses grave accents (`) as escape characters to escape keywords in the following format: `Keyword`.
    • Oracle and PostgreSQL use double quotation marks (") as escape characters to escape keywords in the following format: "Keyword".
    • SQL Server uses brackets ([]) as escape characters to escape keywords in the following format: [Keyword].
  • A MySQL data source is used in the following example:
    1. Execute the following statement to create a table named aliyun, which contains a column named table: create table aliyun (`table` int ,msg varchar(10));
    2. Execute the following statement to create a view and assign an alias to the table column: create view v_aliyun as select `table` as col1,msg as col2 from aliyun;
      Note
      • MySQL uses table as a keyword. If the name of a column in the source table is table, an error is reported during data synchronization. In this case, you must create a view to assign an alias to the table column.
      • We recommend that you do not use a keyword as the name of a column.
    3. You can execute the preceding statement to assign an alias to the column whose name is a keyword. When you configure a synchronization node, use the v_aliyun view to replace the aliyun table.

Why is no data obtained when I read data from a LogHub table whose columns contain data?

In LogHub Reader, column names are case-sensitive. Check for the column name configuration in LogHub Reader.

Why is some data missing when I read data from a LogHub data source?

In Data Integration, a synchronization node reads data from a LogHub data source at the time when the data is generated in LogHub. Check whether the value of the metadata field receive_time, which is configured for reading data, is within the time range specified for the synchronization node in the LogHub console.

What do I do if the fields that I read based on the field mapping configuration in LogHub are not the expected fields?

Manually modify the configuration of the column parameter in the LogHub console.

I configured the endDateTime parameter to specify the end time for reading from a Kafka data source, but some data that is returned is generated at a time point later than the specified end time. What do I do?

Kafka Reader reads data from a Kafka data source in batches. If data that is generated later than the time specified by endDateTime is found in a batch of read data, Kafka Reader stops reading data. However, the data generated later than the end time is also written to the destination.
  • You can configure the skipExceedRecord parameter to specify whether to write such data to the destination. For more information, see Kafka Reader.We recommend that you set the skipExceedRecord parameter to false to prevent data loss.
  • You can use the max.poll.records parameter in Kafka to specify the amount of data to be read at the same time. Configure this parameter and the number of synchronization nodes that can be run in parallel to control the excess data volume that is allowed. The allowed excess volume of data is calculated based on the following formula: Allowed excess data volume < max.poll.records × Number of synchronization nodes that can be run in parallel.

How do I remove the random strings that appear after I write data to OSS?

When you use OSS Writer to write files to OSS, take note of the name prefixes of the files. OSS simulates the directory structure by adding prefixes and delimiters to file names, such as "object": "datax". This way, the names of the files start with datax and end with random strings. The number of files determines the number of tasks that a synchronization node is split into.

If you do not want to use a random universally unique identifier (UUID) as the suffix, we recommend that you set the writeSingleObject parameter to true. For more information, see the description of the writeSingleObject parameter in OSS Writer.

OSS Writer

How does the system synchronize data from a MySQL data source on which sharding is performed to a MaxCompute table?

For more information about how to configure MySQL Reader to read data from a MySQL data source, see MySQL Reader.