This topic describes how to use Data Integration to import offline data to DataHub. In this example, a sync node is configured to synchronize data from Stream to DataHub in the code editor.

Prerequisites

  1. An Alibaba Cloud account and its AccessKey pair are created.
  2. MaxCompute is activated. A default MaxCompute connection is automatically created. The Alibaba Cloud account is used to log on to DataWorks.
  3. A workspace is created so that you can collaboratively develop workflows and maintain data and nodes in the workspace. For more information, see Create a workspace.
    Note If you want to create a data integration node as a RAM user, grant the required permissions to the RAM user. For more information, see Prepare a RAM user and Manage workspace members.

Background information

Data Integration is a data synchronization platform that is provided by Alibaba Cloud. The platform is reliable, secure, cost-efficient, and scalable. It can be used across heterogeneous data storage systems and provides offline data access channels in diverse network environments for more than 20 types of connections.

In this example, a DataHub connection is configured. For information about how to use other types of connections to configure sync nodes, see Reader configuration and Writer configuration.

Procedure

  1. Go to the Data Integration page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. After you select the region where the required workspace resides, find the workspace and click Data Integration.
  2. On the Data Integration homepage, click New synchronization task to go to the DataStudio page.
  3. In the Create Node dialog box, set the Node Name and Location parameters and click Commit.
    Note
    • The node name must be 1 to 128 characters in length.
    • Set the Location parameter to the directory where the created workflow resides. For more information, see Create a workflow.
  4. After the batch sync node is created, click the Switch to Code Editor icon in the top toolbar.
  5. In the Confirm message, click OK. The code editor appears.
  6. Click the Apply Template icon in the top toolbar.
  7. In the Apply Template dialog box, set the Source Connection Type parameter to Stream and the Target Connection Type parameter to DataHub, and click OK.
  8. After the template is applied, edit code as required.
    {
    "type": "job",
    "version": "1.0",
    "configuration": {
     "setting": {
       "errorLimit": {
         "record": "0"
       },
       "speed": {
         "mbps": "1",
         "concurrent": 1, // The number of concurrent threads.
         "throttle": false
       }
     },
     "reader": {
       "plugin": "stream",
       "parameter": {
         "column": [ // The columns in the source table.
           {
             "value": "field", // The column attribute.
             "type": "string"
           },
           {
             "value": true,
             "type": "bool"
           },
           {
             "value": "byte string",
             "type": "bytes"
           }
         ],
         "sliceRecordCount": "100000"
       }
     },
     "writer": {
       "plugin": "datahub",
       "parameter": {
         "datasource": "datahub", // The name of the connection.
         "topic": "xxxx", // The minimum unit for data subscription and publication in DataHub. You can use topics to distinguish different types of streaming data.
         "mode": "random", // Data is randomly written.
         "shardId": "0", // Shards are concurrent channels that are used for data transmission in a topic. Each shard has a unique ID.
         "maxCommitSize": 524288, // The amount of data, in MB, that DataHub Writer buffers before DataHub Writer sends the data to the destination for the purpose of improving writing efficiency. Default value: 1048576, in bytes, namely 1 MB.
         "maxRetryCount": 500
       }
     }
    }
    }
  9. After the configuration is completed, click the Save and Run icons.
    Note
    • You can import data to DataHub only in the code editor.
    • If you want to change the template, click the Apply Template icon in the top toolbar. The original content is overwritten after you apply the new template.
    • After you save the sync node, click the Run icon. The node is immediately run.

      You can also click the Submit icon to commit the sync node to the scheduling system. The scheduling system runs the node at the scheduled time from the next day based on your configurations.