This topic describes the FAQ about Data Integration.

What connections does Data Integration support?

For more information about connections supported by Data Integration, see Supported data stores and plug-ins. The topic provides links to other topics where you can learn how to create a connection and how to configure a reader or writer to read or write data.

How do I troubleshoot Data Integration issues?

If an error occurs with operations performed in Data Integration, locate the fault first. You can check the information related to the error, such as the resource group where nodes are run, connections, and the region where node instances reside.
  • Check the resource group where nodes are run.
    • If nodes are run on the default resource group, the following information appears in logs:
      running in Pipeline[basecommon_ group_xxxxxxxxx]
    • If nodes are run on a custom resource group of Data Integration, the following information appears in logs:
      running in Pipeline[basecommon_xxxxxxxxx]
    • If nodes are run on an exclusive resource group for Data Integration, the following information appears in logs:
      running in Pipeline[basecommon_S_res_group_xxx]
  • Check connection information.
    You need to check the following configurations of connections:
    1. Check the names and types of the connections to the source and destination data stores.
    2. Check the environment in which the connections are used.

      Example: ApsaraDB connections, connections in the connection string mode where Data Integration can be directly connected, connections in the connection string mode where Data Integration cannot be directly connected, Relational Database Service (RDS) or other connections in a Virtual Private Cloud (VPC), and connections in Finance Cloud (VPC and classic network).

    3. Check whether each connection has passed the connectivity test.
      Check whether the configurations of connections are correct. For more information, see Supported data stores. Some examples of invalid configurations are as follows:
      • Multiple database names are incorrect.
      • The entered information contains space characters or special characters.
      • Connectivity testing is not supported for the connections, such as non-RDS connections in VPCs.
  • Check the region where node instances reside.
    Log on to the DataWorks console to view the relevant region, such as China (Shanghai), China (Shenzhen), China (Hong Kong), Singapore, Germany (Frankfurt), and Australia (Sydney). By default, the China (Shanghai) region is selected.
    Note You can view the region only after you purchase the MaxCompute service.
  • Copy the error code that appears on the page.

    If an error occurs, copy the error code to a ticket and submit the ticket for consultation.

How do I configure a Data Integration node?

You can configure a Data Integration node on the codeless user interface (UI) or in the code editor. For more information, see Create a sync node by using the codeless UI and Create a sync node by using the code editor.

What can I do if a sync node is waiting for slot resources?

Symptom: A sync node is not functioning properly, and no logs are recorded for the node instances. The status of the node is Pending (Resources).

This is because the sync node is configured to use a custom resource group but no custom resource group is available. Follow these steps to address this issue:
  1. Log on to the DataWorks console.
  2. In the left-side navigation pane, click Workspaces.
  3. On the Workspaces page that appears, find the target workspace and click Data Analytics in the Actions column.
  4. On the DataStudio page that appears, click Icon in the upper-left corner and choose All Products > Operation Center.
  5. On the Operation Center page that appears, choose Cycle Task Maintenance > Cycle Task in the left-side navigation pane. On the Cycle Task page, right-click the node that is not scheduled as expected in the directed acyclic graph (DAG) and select View Node Details to view the resource group used by the node.
  6. Go back to the homepage of the DataWorks console and click Resource Groups in the left-side navigation pane. On the Resource Groups page, click the Custom Resource Groups tab.

    Find the target resource group and click Manage Server in the Actions column. Check whether the ECS instance is stopped or occupied by other nodes. If the ECS instance is stopped, start it. If the ECS instance is occupied by other nodes, wait until the nodes are completed.

If the issue persists, restart the service by running the following commands:
su - admin
/home/admin/alisatasknode/target/alisatasknode/bin/serverctl restart

How do I handle issues related to the encoding format?

Symptom: After you set an encoding format for a sync node, if the data to be synchronized contains emoticons, the sync node may fail with dirty data generated or the sync node is successful but the synchronized data contains garbled characters.
  • A sync node fails with dirty data generated.
    Symptom: A sync node fails and dirty data is generated due to the encoding format. The following logs are generated:
    016-11-18 14:50:50.766 [13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data
    {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"},
    {"byteSize":33,"index":1,"rawData":"hello[1](http://docs-aliyun.cn-hangzhou.oss.aliyun-inc.com/assets/pic/56134/cn_zh/149872864****/%E5%9B%BE%E7%89%877.png)
    cry","type":"STRING"},
    {"byteSize":8,"index":4,"rawData":0,"type":"LONG"}],"type":"writer"}
    2016-11-18 14:50:51.265 [13350975-0-0-writer] WARN  CommonRdbmsWriter$Task - Roll back the synchronization, and enable the writer to write only one row each time: java.sql.BatchUpdateException: Incorrect string value: '\xF0\x9F\x88\xB6\xEF\xB8...' for column 'introduction' at row 1

    Cause: When you set the encoding format for a database or create a connection, you do not set the encoding format to utf8mb4. Only the utf8mb4 encoding format supports emoticons.

    You can use one of the following methods to address this issue:
    • When you create a connection by using a Java Database Connectivity (JDBC) URL, set the encoding format to utf8mb4. An example setting is jdbc:mysql://xxx.x.x.x:3306/database? com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
    • When you create a connection by using an instance ID, suffix the database name with the encoding format information. An example setting is database? com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
    • Change the encoding format of the connection to utf8mb4. For example, you can change the encoding format of an ApsaraDB for RDS instance in the ApsaraDB for RDS console.
      Note If you want to use set names utf8mb4 to set the encoding format of an ApsaraDB for RDS connection to utf8mb4, you must create the connection in the connection string mode without a public IP address.
  • A sync node is successful but the synchronized data contains garbled characters.
    The possible causes for this issue are as follows:
    • The source data contains garbled characters.
    • The specified encoding format is different between the reader and the writer.
    • The encoding format of the browser is different from that of the reader or writer, and therefore the preview fails or the previewed data contains garbled characters.
    Address this issue as follows:
    • If the source data contains garbled characters, process the source data before running a sync node.
    • If the encoding format of the reader is different from that of the writer, correct the settings for the encoding format of the reader to be the same as that of the writer.
    • If the encoding format of the browser is different from that of the reader or writer, correct the settings for the encoding format of the browser to be the same as that of the reader or writer and then preview data.

What can I do if ApsaraDB for RDS data fails to be synchronized?

Symptom: When data is synchronized from an ApsaraDB RDS for MySQL, ApsaraDB RDS for SQL Server, or ApsaraDB RDS for PostgreSQL instance to a user-created MySQL, SQL Server, or PostgreSQL database, an error message appears, indicating that Data Integration cannot connect to the corresponding database.

Assume that the issue occurs when data is synchronized from an ApsaraDB RDS for MySQL instance to a user-created SQL Server database. Follow these steps to address the issue:
  1. Add an ApsaraDB RDS for MySQL connection by using a JDBC URL.
    Note
    • If the user-created SQL Server database can access the Internet, set the JDBC URL to the public endpoint of the ApsaraDB RDS for MySQL instance. In this case, you can use the default resource group to run sync nodes.
    • If the user-created SQL Server database cannot access the Internet, activate DataWorks Professional Edition or a more advanced edition. In this case, you can use a custom resource group for Data Integration to run sync nodes.
    • If you have used Express Connect to connect your on-premises data center to the VPC where the ApsaraDB RDS for MySQL instance resides, you can use an exclusive resource group for Data Integration to run sync nodes. The VPC of the exclusive resource group for Data Integration is the same as the VPC to which your on-premises data center is connected.
  2. Use the new connection to configure and run a sync node.
    Note For data synchronization between ApsaraDB for RDS instances, such as from an ApsaraDB RDS for MySQL instance to an ApsaraDB RDS for SQL Server instance, we recommend that you create connections by using instance IDs.

What can I do if a sync node fails because the name of a column in the source table is a keyword?

Symptom: A sync node fails because the name of a column in the table to be synchronized is a keyword.

Take a MySQL connection as an example. To resolve this issue, follow these steps:
  1. Execute the following statement to create a table named aliyun, which contains a column named table:
    create table aliyun (`table` int ,msg varchar(10));
  2. Execute the following statement to create a view and assign an alias to the table column:
    create view v_aliyun as  select `table` as col1,msg as col2 from aliyun;
    Note
    • MySQL uses table as a keyword. If the data synchronization code contains table as the name of a column, an error is returned. In this case, you need to assign an alias to the table column by creating a view.
    • We recommend that you do not use a keyword as the name of a column.
    You can execute the preceding statement to assign an alias to the column whose name is a keyword. When you configure a sync node, use the v_aliyun view to replace the aliyun table.
Note
  • MySQL uses grave accents (` `) as escape characters to escape keywords in the following format: `Keyword`.
  • Oracle and PostgreSQL use double quotation marks ('' '') as escape characters to escape keywords in the following format: "Keyword".
  • SQL Server uses brackets ([ ]) as escape characters to escape keywords in the following format: [Keyword].

How do I customize table names in a sync node?

Background: One table is created each day, with the data timestamp indicated in the table name, for example, orders_20170310, orders_20170311, and orders_20170312. These tables follow the same schema.

You can create a sync node to import data of these tables to MaxCompute. When you configure the sync node, you can customize the source table name so that the sync node can automatically collect the table data of the previous day from the source database every early morning. For example, the sync node can automatically collect data of the orders_20170314 table from the source database on March 15, 2017.

To customize the source table name, follow these steps:
  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Analytics in the Actions column.
  2. On the DataStudio page that appears, create a sync node on the codeless UI and select a source table, for example, orders_20170310. For more information, see Create a batch synchronization node.
  3. Click Switch to Code Editor to switch from the codeless UI to the code editor.
  4. In the code editor, use a variable as the name of the source table, for example, orders_${tablename}.

    Assign the value $[yyyymmdd-1] to the ${tablename} variable in the parameter settings of the sync node, because table names are distinguished by day and you want the sync node to collect the table data of the previous day on each day.

    Note You can also directly use the orders_${bdp.system.bizdate} variable as the name of the source table. In this case, you do not need to assign a value.
  5. After you configure the sync node, click Save icon and Submit icon in sequence.

How do I handle the error returned when I use the username root to create a MongoDB connection?

Change the username. You must use a username for connecting to the database that contains the table to be synchronized instead of the username root to create a MongoDB connection.

For example, if you want to import data of the name table in the test database, use a username for connecting to the test database.

I have added information about the resource group that needs to access an ApsaraDB for RDS data store to the whitelist of the data store. Why does the error message "user not exist ip white list reference" appears when I create a connection to the data store?

This issue is usually caused by an incorrect username. Check whether the username is correct. For more information about how to check the username.

Why am I unable to perform big data computing on a resource group for scheduling?

Resource groups for scheduling are used to schedule nodes and contain limited resources. Therefore, they are inapplicable to big data computing. MaxCompute can process a large amount of data. We recommend that you use MaxCompute for big data computing.

When I create a connection to a data store by using a JDBC URL, the data store fails the connectivity test. What can I do in this scenario?

Check the network environment of the database.
  • If the database can access the Internet, set the JDBC URL to the public endpoint of the database.

    Before you use a database as a data store, you must add the IP addresses or Classless Inter-Domain Routing (CIDR) blocks that you use to access the database to the whitelist of the database. This improves security and stability of the database. For more information, see Configure a whitelist.

  • If the database can only access the specific private network, add a custom resource group in the DataWorks console to access the database.

When I create a connection to a user-created on-premises database, the data store fails the connectivity test. What can I do in this scenario?

Check whether the database can access the Internet.
  • If the database can access the Internet, add information about the resource group that needs to access the database to the whitelist of the database based on the region of the target workspace. For more information, see Configure a whitelist.
  • If the database cannot access the Internet, add a custom resource group to access the database. For more information, see Add a custom resource group.

What can I do if an error is reported when DataWorks synchronizes PostgreSQL data?

Symptom: The following error is reported when DataWorks synchronizes PostgreSQL data:
org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

This error occurs because it takes a long time to obtain data from the PostgreSQL database. To resolve this issue, specify the max_standby_archive_delay and max_standby_streaming_delay parameters in the code of the sync node. For more information, see Standby Server Events.

What can I do if an error is reported when DataWorks synchronizes MaxCompute data to DRDS?

Symptom: The following error is reported when DataWorks synchronizes MaxCompute data to Distributed Relational Database Service (DRDS):
docs not support duplicate values in insert

This error occurs because DRDS does not support global secondary indexes or duplicate values.

What can I do if an error is reported when I connect Power BI to MaxCompute?

Currently, MaxCompute cannot connect to Power BI. We recommend that you connect Hologres instead of Power BI to MaxCompute.

When I configure a batch sync node, the desired table does not appear in the Table drop-down list in the Connections section. How do I find the desired table in this scenario?

When you configure a batch sync node, the Table drop-down list in the Connections section displays only the first 25 tables in the selected data store by default. If the selected data store contains more than 25 tables and your desired table does not appear in the Table drop-down list, enter the name of the desired table in the Table field. Alternatively, configure the batch sync node in the code editor.

When I create a connection to a user-created MongoDB data store in a VPC, the data store fails the connectivity test. What can I do in this scenario?

Symptom: When you create a connection to a user-created MongoDB data store in a VPC, the data store fails the connectivity test. You can use one of the following methods to address this issue:
  • Method 1: Enable the data store to access the Internet.
    1. Add a connection to the user-created MongoDB data store and set the Connect To parameter to Connection Mode.
    2. Enable the MongoDB data store to access the Internet.
    3. Add information about the corresponding resource group to the whitelist of the MongoDB data store. For more information, see Configure a whitelist.
    4. Test the connectivity of the MongoDB data store. For more information, see Test data store connectivity.
  • Method 2: Configure a custom resource group to access the user-created MongoDB data store over the private network.
    Notice You can only use this method in DataWorks Professional Edition or a more advanced edition.
    1. Prepare an ECS instance that is in the same region and on the same private network as the MongoDB data store. For more information, see Add a custom resource group.
    2. Add the IP address of the ECS instance to the whitelist or security group of the MongoDB data store.
    3. Save the connection settings without testing the connectivity. MongoDB data stores do not support connectivity tests currently.
    4. Use the custom resource group to run the sync node for synchronizing data from the MongoDB data store and test the sync node.

What can I do if the data synchronized by a DataWorks sync node with multiple threads is out of order?

A sync node reads data from MaxCompute tables in a random order. If you do not customize order settings, the data returned by the sync node is also out of order.

By default, the data synchronized from MaxCompute is stored in a random order. If you want to obtain sorted data, specify a parameter to sort synchronized data. For example, you can specify order by xx limit n in the SQL statement of the sync node to sort data.

What can I do if MaxCompute data fails to be synchronized to AnalyticDB for MySQL?

Add the account for connecting to AnalyticDB for MySQL as a member of the corresponding DataWorks workspace.

What can I do if a sync node that is run on the default resource group fails to read data from an AnalyticDB for MySQL data store?

Purchase an exclusive resource group, bind it to your VPC, and then run the sync node on the exclusive resource group to read data from the AnalyticDB for MySQL data store. For more information, see Test data store connectivity.

Can I run a real-time sync node on the node configuration tab?

No, you cannot run a real-time sync node on the node configuration tab. Instead, you must run a real-time sync node in the production environment after committing and deploying the node. For more information, see Create, commit, and manage real-time sync nodes.

What can I do if an exclusive resource group that resides in a VPC and used to run a real-time sync node cannot connect to the target data store?

  1. Verify that the exclusive resource group and data store are in the same VPC and zone and are bound to the same VSwitch.
  2. Add the Elastic IP Address (EIP) and VPC or VSwitch CIDR block of the exclusive resource group to the whitelist or security group of the data store. For more information, see Exclusive resource groups.

What can I do if an error is reported when a sync node attempts to read data from a MySQL data store?

DataWorks Data Integration cannot run sync nodes on the default resource group or an exclusive resource group to read data from a MySQL 8.0 data store. If a sync node is run on the default resource group or an exclusive resource group to read data from a MySQL 8.0 data store, the following error is reported: Unknown system variable 'query_cache_size' - java.sql.SQLException: Unknown system variable 'query_cache_size'

If the data store is a MySQL 8.0 data store, create a custom resource group to run the sync node. For more information, see Add a custom resource group.