All Products
Search
Document Center

Import and export data through Data Integration

Last Updated: Aug 17, 2020

This topic describes how to import data to or export data from Distributed Relational Database Service (DRDS) databases through Data Integration.

Data Integration is a data synchronization platform provided by Alibaba Group. It supports data storage across heterogeneous systems and offers offline (both full and incremental) data access channels in diverse network environments for more than 20 types of data sources. For a detailed list of data source types, see Supported data source types.

You can use Data Integration to complete the following data synchronization tasks on DRDS:

  • Synchronize data from a DRDS database to other data sources, and process the data accordingly.
  • Synchronize the processed data from other data sources to the DRDS database.

This topic contains the following content:

Process overview

The data synchronization process contains the following steps:

  • Step 1: Create a table in the data source end.
  • Step 2: Add a data source.
  • Step 3: Configure a synchronization task in wizard mode or script mode.
  • Step 4: Run the synchronization task and check the data quality at the target end.

Preparations

Before you use Data Integration to import data to or export data from a DRDS database, please complete the following preparations:

  1. Sign up an account and perform real-name verification on the Alibaba Cloud website. Create an AccessKey pair for the account.
  2. Activate MaxCompute (formerly known as ODPS). A default ODPS data source is automatically generated. Use the Alibaba Cloud account to log on to DataWorks.
  3. Create a workspace. To use DataWorks, first create a workspace. Then, you can complete the workflow and maintain data and jobs through collaboration within the workspace.
  4. If you want to create Data Integration nodes by using a Resource Access Management (RAM) user, you must grant corresponding permissions to the RAM user.

Add a data source

The following example describes how to add a DRDS data source.

Note: Only the workspace administrator can create a data source. Other roles can only view data sources.

  1. Log on to the DataWorks console as the workspace administrator.

  2. On the Workspaces page, find the target workspace and click Enter the Workspace in the Actions column.

  3. Click Data Integration in the upper-right corner. Then, click Data Sources in the left-side navigation pane.

  4. Click Add Data Source in the upper-right corner.

  5. In the Add Data Source dialog box, set parameters as required.
    The configuration items of the DRDS data source are described as follows:

    • Data Source Name: the name of the data source, which can contain letters, numbers, and underscores (). It must start with a letter or a underscore () and cannot exceed 60 characters in length.
    • Description: the description of the data source, which cannot exceed 80 characters in length.
    • Data Source Type: DRDS is selected in this example.
    • Network Type: the selected network type.
    • JDBCUrl: the JDBC URL. The format is jdbc://mysql://serverIP:Port/database.
    • Username and Password: the username and password used to connect to the database
  6. After you complete the configurations, click Test Connectivity.

  7. After the data source passes the connectivity test, click Complete.

Import data through Data Integration

The following content describes how to synchronize data from MaxCompute to a DRDS database in the wizard mode of Data Integration.

  1. On the Data Integration page, create a synchronization task.

    • Wizard mode: The wizard mode provides visualized interfaces to configure the synchronization task in five steps: select the source, select the target, map the fields, configure the channel, and preview and save the configurations. For different data sources, the interfaces for these steps may be different. You can switch from the wizard mode to the script mode.
    • Script mode: You can select the corresponding template in script mode. The template contains the main parameters of the synchronization task. You can create a synchronization task by setting more parameters. You cannot switch from the script mode to wizard mode.
  2. Select the source table. Select MaxCompute and the source table mytest. The data browsing area is collapsed by default, then click Next:

  3. Select the target table. Select DRDS and the target table contact_infos. Click Next:

    • preSql: SQL statement that is first executed before the data synchronization task is executed Currently, you can execute only one SQL statement in wizard mode, and multiple SQL statements in script mode. For example, clear old data.
    • postSql: the SQL statement that is executed after the data synchronization task is executed. Currently, you can execute only one SQL statement in wizard mode, and multiple SQL statements in script mode. For example, add a timestamp.
  4. Set field mappings. The mapping between Source Table Fields on the left and Target Table Fields on the right is one-to-one.

  5. On the Channel Control page, click Next to configure the maximum job rate and dirty data check rules.

    • Maximum Rate: the maximum data transmission rate. The actual rate may be affected by the network environment and database configuration.
    • Concurrent Job Count: Maximum job rate = Concurrent job count × Individual job transmission rate.


    When the maximum job rate is specified, set the concurrent job count according to the following principles:

    • Do not set too many concurrent connections if you use an online database as the data source. Too many concurrent connections can impact the performance of the database.
    • If you require a high data synchronization rate, we recommend that you choose the highest job rate and a large concurrent job count.
  6. After you complete the preceding configuration, you can scroll up or down to view the job configurations. Make sure the configuration of the job is correct, and click Save.

  7. Click Run Task to directly run the synchronization task. You can submit the synchronization task to the scheduling system. The scheduling system will automatically and periodically run the task from the second day according to the configured properties.

Configure a synchronization task in script mode

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "reader": {
  6. "plugin": "odps",
  7. "parameter": {
  8. "datasource": "lzz_odps",//The data source name. We recommend that you add the data source before synchronization.
  9. "table": "mytest",//The name of the data source table.
  10. "partition": "",//The sharding information.
  11. "column": [
  12. "id"
  13. ]
  14. }
  15. },
  16. "writer": {
  17. "plugin": "drds",
  18. "parameter": {
  19. "datasource": "l_Drds_w",//The data source name. We recommend that you add the data source before synchronization.
  20. "table": "contact_infos",//The name of the target table.
  21. "preSql": [],//The pre-import preparation statement.
  22. "postSql": [],//The post-import preparation statement.
  23. "column": [
  24. "id"
  25. ]
  26. }
  27. },
  28. "setting": {
  29. "speed": {
  30. "mbps": "1",//The maximum rate of a concurrent job is 1 MB/s.
  31. "concurrent": "1"/The number of concurrent jobs.
  32. }
  33. }
  34. }
  35. }

Export data through Data Integration

The following content describes how to synchronize data from a DRDS database to MaxCompute in wizard mode.

  1. On the Data Integration page, create a synchronization task.

  2. Select the source table. Select DRDS and the source table bit_type_test. The data browsing area is collapsed by default. Click Next.

    • Filtering condition: DrdsReader concatenates SQL statements based on the table and column information and the WHERE condition you have configured, and uses the generated SQL statements to filter and read data. For example, during a test, you can set the WHERE condition according to actual scenarios, and synchronize the data on the current day, in which case you can set the WHERE condition to STRTODATE(‘${bdp.system.bizdate}’, ‘%Y%m%d’) <= taday AND taday < DATEADD(STRTODATE(‘${bdp.system.bizdate}’, ‘%Y%m%d’), interval 1 day).
    • Shard key: You can use a column in the source table as the shard key, and perform the concurrent data synchronization after sharding. Currently, only integer fields are supported. We recommend that you use a primary key or a column with an index as the shard key.
  3. Select MaxCompute and the target table mytest, then click Next.

  4. Click Next and then select the mapping between fields. The mapping between Source Table Fields on the left and Target Table Fields on the right is one-to-one.
    You can click Add a Row to add a mapping.

    • You can enter constants. Each constant must be enclosed in apostrophes. For example, 'abc' and '123'.
    • You can add scheduling parameters, such as '${bdp.system.bizdate}'.
    • You can enter functions supported by relational databases, such as now() and count(1).
    • If the value you entered cannot be parsed, the type is displayed as '-'.
  5. On the Channel Control page, click Next to configure the maximum job rate and dirty data check rules.

    • Maximum Rate: the maximum data transmission rate. The actual rate may be affected by the network environment and database configuration.
    • Concurrent Job Count: Maximum job rate = Concurrent job count × Individual job transmission rate.

    If the maximum job rate is specified, choose the concurrent job count in either of the following ways:

    • Do not set too many concurrent connections if you use an online database as the data source. Too many concurrent connections can impact the performance of the database.
    • If you require a high data synchronization rate, we recommend that you choose the highest job rate and a large concurrent job count.
  6. After you complete the above configuration, you can scroll up or down to view the task configurations. Verify that the configuration is correct and click Save.

  7. Click Run Task to directly run the synchronization task. You can submit the synchronization task to the scheduling system. The scheduling system will automatically and periodically run the task from the second day according to the configured properties.

Configure a synchronization task in script mode

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "reader": {
  6. "plugin": "drds",
  7. "parameter": {
  8. "datasource": "l_Drds_w",//The data source name. We recommend that you add the data source before synchronization.
  9. "table": "bit_type_test",/The name of the data source table.
  10. "where": "",
  11. "splitPk": "col2",//The shard key.
  12. "column": [
  13. "idbit"
  14. ]
  15. }
  16. },
  17. "writer": {
  18. "plugin": "odps",
  19. "parameter": {
  20. "datasource": "lzz_odps",//The data source name. We recommend that you add the data source before synchronization.
  21. "table": "mytest",
  22. "truncate": true,
  23. "partition": "",//The sharding information.
  24. "column": [
  25. "id"
  26. ]
  27. }
  28. },
  29. "setting": {
  30. "speed": {
  31. "mbps": "1",//The maximum transmission rate.
  32. "concurrent": "1"//The number of concurrent jobs.
  33. },
  34. "errorLimit": {
  35. "record": "234"//The number of error records.
  36. }
  37. }
  38. }
  39. }