All Products
Document Center

Import data in batches using data integration of DataHub

Last Updated: May 17, 2019

The Data integration is a data synchronization platform provided by the Alibaba Group. Data Integration is a reliable, secure, cost-effective, and elastically scalable data synchronization platform. It can be used across the heterogeneous data storage systems and provides offline (full/incremental) data access channels in different network environments for more than 20 data sources. For more information about data source types, see Supported data source.

This article explains how to import data for offline DataHub by using Data Integration.


  1. Activate an Alibaba Cloud primary account, and create the AccessKeys for this account.

  2. Activate MaxCompute to auto generate a default MaxCompute data source, and log on to DataWorks using the primary account.

  3. Create a project that you can complete the workflow in a project collaborating both maintain data and tasks. You must create a project before using DataWorks.

    Note: If you want to create data integration tasks by using a sub-account, grant related permissions to the sub-account. For more information, see Prepare a RAM account and Project member management.


For example, synchronize Stream data to DataHub in script mode:

  1. Log on to the DataWorks console as a developer and click Enter Project.


  2. Click Data Integration from the upper menu, and go to the Sync Tasks page.

  3. Select New > Script Mode on the page.


  4. Select a Source Type and a Type of objective in the Import Template window that appears. See the following figure.


  5. Click Confirmation to enter the script mode configuration page and perform the configuration as needed. If you have any questions, click Help Manual in the upper-right corner.

    1. {
    2. "configuration": {
    3. "setting": {
    4. "speed": {
    5. "concurrent": "1", //Number of concurrent tasks
    6. "mbps": "1" //Maximum job rate
    7. },
    8. "errorLimit": {
    9. "record": "0"// Number of error records
    10. }
    11. },
    12. "reader": {
    13. "parameter": {
    14. "column": [ //Column name of the source
    15. {
    16. "value": "1",
    17. "type": "string" //Column properties
    18. },
    19. {
    20. "value": "1",
    21. "type": "string"
    22. }
    23. ],
    24. "sliceRecordCount": "1"
    25. },
    26. "plugin": "stream"
    27. },
    28. "writer": {
    29. "parameter": {
    30. "topic": "xxxxx", //Topic is the smallest unit of DataHub subscription and publishing. You can use Topic to represent a class or a kind of streaming data.
    31. "project": "xxxx", //Project is the basic unit of DataHub data, which contains multiple topics.
    32. "accessKey": "xxxxxxxx", //DataHub AccessKey.
    33. "shardId": "0", //Shard represents a concurrent channel for data transmission of a Topic, and each Shard has a corresponding ID.
    34. "maxRetryCount": 500, //Maximum retry attempts
    35. "maxCommitSize": 524288. To improve the writing efficiency, submit data to the target end in batches when the collected data size reaches maxCommitSize (in MB). The maxCommitSize is 1048576 (1 MB) by default.
    36. "accessId": "xxxxxxx",//DataHub accessId
    37. "endpoint": "", //For requests to access DataHub resources, select the correct domain name based on the service to which the resource belongs.
    38. "mode": "random" //Random write.
    39. },
    40. "plugin": "datahub"
    41. }
    42. },
    43. "type": "job",
    44. "version": "1.0"
    45. }
  6. Click Save.


    • DataHub only supports importing data in script mode.

    • If you want to choose a new template, click Import Template in the toolbar. Note that the existing content is overwritten once the new template is imported. Proceed with caution.

  7. Click Run to run the synchronization task, and check the data quality of the target.


    After saving a synchronization task, click Run directly, and the task runs immediately, or click Submit to submit the synchronization task to the scheduling system. The scheduling system automatically and cyclically runs the task from the second day according to the configuration properties. For more information on scheduling configurations, see Scheduling configuration.