This topic describes how to use Data Integration to offline import data to DataHub. In this topic, a sync node is configured to synchronize data from Stream to DataHub in the code editor.

Prerequisites

  1. An Alibaba Cloud account and its AccessKey pair are created. For more information, see Prepare an Alibaba Cloud account.
  2. MaxCompute is activated, a default MaxCompute connection is automatically created, and the Alibaba Cloud account is used to log on to the DataWorks console.
  3. A workspace is created so that you can create a workflow in the workspace and create different types of nodes in the workflow to maintain data and complete data analytics. For more information, see Create a workspace.
    Note If you want to create a data integration node as a Resource Access Management (RAM) user, grant the required permissions to the RAM user. For more information, see Prepare a RAM user and Manage workspace members.

Background information

Data Integration is a reliable, secure, cost-effective, and scalable data synchronization platform of Alibaba Cloud. It can be used across heterogeneous data storage systems and provides offline data access channels in diverse network environments for more than 20 types of connections.

Procedure

  1. Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the target workspace and click Data Integration in the Actions column.
  2. On the Home Page page that appears, click New Task to go to the DataStudio page.
  3. In the Create Node dialog box that appears, set Node Name and Location and click Commit.
    Note
    • The node name can be up to 128 characters in length.
    • Set Location to the directory where the created workflow resides. For more information about how to create a workflow, see Create a workflow.
  4. After creating the batch sync node, click the Switch to Code Editor icon in the toolbar on the node editing tab.
  5. In the Confirm dialog box that appears, click OK to switch to the code editor.
  6. Click the Apply Template icon in the toolbar.
  7. In the Apply Template dialog box that appears, set Source Connection Type to Stream and Target Connection Type to DataHub, select the destination connection, and then click OK to apply a template.
  8. Edit code as required in the code editor.
    {
    "type": "job",
    "version": "1.0",
    "configuration": {
     "setting": {
       "errorLimit": {
         "record": "0"
       },
       "speed": {
         "mbps": "1",
         "concurrent": 1,// The maximum number of concurrent threads.
         "throttle": false
       }
     },
     "reader": {
       "plugin": "stream",
       "parameter": {
         "column": [// The columns in the source table.
           {
             "value": "field",// The column attribute.
             "type": "string"
           },
           {
             "value": true,
             "type": "bool"
           },
           {
             "value": "byte string",
             "type": "bytes"
           }
         ],
         "sliceRecordCount": "100000"
       }
     },
     "writer": {
       "plugin": "datahub",
       "parameter": {
         "datasource": "datahub",// The name of the destination connection.
         "topic": "xxxx",// The minimum unit for data subscription and publication. You can use topics to distinguish different types of streaming data.
         "mode": "random",// The mode in which data is written. The value random indicates that data is randomly written.
         "shardId": "0",// The shard ID. Shards are concurrent channels used for data transmission in a topic. Each shard has a unique ID.
         "maxCommitSize": 524288,// The amount of data, in MB, that DataHub Writer buffers before sending it to the destination for the purpose of improving writing efficiency. The default value is 1048576, in KB, that is, 1 MB.
         "maxRetryCount": 500
       }
     }
    }
    }
  9. After the configuration is completed, click the Save and Run icons in sequence.
    Note
    • You can only import data to DataHub in the code editor.
    • If you want to change the template, click Apply Template in the toolbar again. The original content is overwritten by the content in the new template once you apply the new template.
    • After saving the batch sync node, click the Run icon. The node runs immediately.

      You can also click the Submit icon to submit the sync node to the scheduling system. The scheduling system then automatically runs the node from the next day based on the scheduling time parameters.