The Data integration is a data synchronization platform provided by the Alibaba Group. Data Integration is a reliable, secure, cost-effective, and elastically scalable data synchronization platform. It can be used across the heterogeneous data storage systems and provides offline (full/incremental) data access channels in different network environments for more than 20 data sources. For more information about data source types, see Supported data source.
This article explains how to import data for offline DataHub by using Data Integration.
Activate an Alibaba Cloud primary account, and create the AccessKeys for this account.
Activate MaxCompute to auto generate a default MaxCompute data source, and log on to DataWorks using the primary account.
Create a project that you can complete the workflow in a project collaborating both maintain data and tasks. You must create a project before using DataWorks.
For example, synchronize Stream data to DataHub in script mode:
Log on to the DataWorks console as a developer and click Enter Project.
Click Data Integration from the upper menu, and go to the Sync Tasks page.
Select New > Script Mode on the page.
Select a Source Type and a Type of objective in the Import Template window that appears. See the following figure.
Click Confirmation to enter the script mode configuration page and perform the configuration as needed. If you have any questions, click Help Manual in the upper-right corner.
"concurrent": "1", //Number of concurrent tasks
"mbps": "1" //Maximum job rate
"record": "0"// Number of error records
"column": [ //Column name of the source
"type": "string" //Column properties
"topic": "xxxxx", //Topic is the smallest unit of DataHub subscription and publishing. You can use Topic to represent a class or a kind of streaming data.
"project": "xxxx", //Project is the basic unit of DataHub data, which contains multiple topics.
"accessKey": "xxxxxxxx", //DataHub AccessKey.
"shardId": "0", //Shard represents a concurrent channel for data transmission of a Topic, and each Shard has a corresponding ID.
"maxRetryCount": 500, //Maximum retry attempts
"maxCommitSize": 524288. To improve the writing efficiency, submit data to the target end in batches when the collected data size reaches maxCommitSize (in MB). The maxCommitSize is 1048576 (1 MB) by default.
"accessId": "xxxxxxx",//DataHub accessId
"endpoint": "http://xxxxx.aliyun-inc.com", //For requests to access DataHub resources, select the correct domain name based on the service to which the resource belongs.
"mode": "random" //Random write.
DataHub only supports importing data in script mode.
If you want to choose a new template, click Import Template in the toolbar. Note that the existing content is overwritten once the new template is imported. Proceed with caution.
Click Run to run the synchronization task, and check the data quality of the target.
After saving a synchronization task, click Run directly, and the task runs immediately, or click Submit to submit the synchronization task to the scheduling system. The scheduling system automatically and cyclically runs the task from the second day according to the configuration properties. For more information on scheduling configurations, see Scheduling configuration.