Data Integration, a feature of DataWorks, synchronizes data from external sources into MaxCompute. It supports batch synchronization for scheduled loads, real-time synchronization for continuous streaming, and local file uploads for ad hoc imports.
Before you begin
Before you begin, ensure that you have:
Created a MaxCompute project.
Created a table in the MaxCompute project to store the imported data.
Created a DataWorks workspace.
Attached a MaxCompute compute resource to the DataWorks workspace.
Compare import methods
Choose an import method based on your data source, latency requirements, and data volume.
| Method | When to use | Limitations |
|---|---|---|
| Local file upload -- Upload CSV, XLS, XLSX, or JSON files through the DataWorks console. | Ad hoc imports of small datasets from your local machine. | CSV files: 5 GB max. Other file types: 100 MB max. |
| OSS upload -- Import files from an OSS bucket through the DataWorks console. | Import files already stored in OSS. | The OSS bucket must be in the same region as the MaxCompute project. |
| Batch synchronization -- Create a synchronization node in DataStudio to read from a data source and write to MaxCompute on a schedule. | Periodic bulk imports from databases and other cloud services. One or more source tables can load into a single MaxCompute table per node. | Runs on a schedule, not continuously. |
| Real-time synchronization -- Create a synchronization link that continuously streams data from a source to MaxCompute. | Low-latency, incremental synchronization for a single table or an entire database. | Requires resource groups for continuous operation. |
| Synchronization solutions -- Preconfigured solutions for entire-database batch sync and full-plus-incremental real-time sync. | Full database migrations, combined full and incremental loads. | Scenario-specific; review the solution documentation for supported sources. |
Upload local files
CSV files: 5 GB max. XLS, XLSX, and JSON files: 100 MB max each.
Supported file formats: CSV, XLS, XLSX, and JSON. Import from your machine (Local File) or from an OSS bucket in the same region as the MaxCompute project (OSS).
Log on to the DataWorks console and select a region in the upper-left corner.
In the left-side navigation pane, choose Data Integration > Data Upload and Download.
Click the upload icon, then click Upload Data.
Select the import option (Local File or OSS), choose the file, and map it to the destination MaxCompute table.
For detailed instructions, see Upload data.
For earlier versions of DataWorks workspaces, upload local CSV or custom text files to a MaxCompute table by following Upload data (earlier versions).
Create a batch synchronization node
Batch synchronization uses Reader and Writer plug-ins to read data from sources and write data to destinations. Each batch synchronization node can import data from one or more tables into a single MaxCompute table.
Configure a batch synchronization node through the codeless UI or the code editor:
Log on to the DataWorks console and select a region in the upper-left corner.
In the left-side navigation pane, choose Data Development and O&M > Data Development.
In the Select Workspace section, click Go To Data Studio.
In the left pane of DataStudio, click the Create Node icon and choose Data Integration > Batch Synchronization.
Set the data source to the source system and the data destination to MaxCompute.
Configure field mappings, scheduling, and resource settings.
For configuration details, see Configure a node in the codeless UI or Configure a node in the code editor.
Create a real-time synchronization node
Real-time synchronization combines input and output data sources into a synchronization link. This link performs continuous incremental synchronization for a single table or an entire database.
Log on to the DataWorks console and select a region in the upper-left corner.
In the left-side navigation pane, choose Data Development and O&M > Data Development.
In the Select Workspace section, click Go To Data Studio.
In the left pane of DataStudio, click the Create Node icon and choose Data Integration > Real-time Synchronization.
Set the input to the source system and the output to MaxCompute.
For configuration details, see Configure a real-time synchronization task in DataStudio.
Use synchronization solutions
Data Integration provides synchronization solutions for scenarios such as batch synchronization for an entire database and full-plus-incremental real-time synchronization. These solutions support various data sources.
Log on to the DataWorks console and select a region in the upper-left corner.
In the left-side navigation pane, choose Data Integration > Data Integration.
In the left-side navigation pane, select Sync Task and click Create Synchronization Task.
Configure the data source information in the Create Sync Task dialog box.
For details, see Configure a real-time synchronization task for an entire database.
Supported data synchronization features
The following table lists the data synchronization features that MaxCompute supports as a data destination.
| Category | Feature | Supported |
|---|---|---|
| Batch synchronization | Read from single table | Yes |
| Write to single table | Yes | |
| Read incremental data from single table | No | |
| Write incremental data to single table | Yes | |
| Real-time synchronization | Read incremental data from entire database | No |
| Write incremental data to entire database | Yes | |
| Synchronization solutions | Read from entire database (batch) | No |
| Write to entire database (batch) | Yes | |
| Read full and incremental data from single table or entire database (real-time) | No | |
| Write full and incremental data to single table or entire database (real-time) | Yes |
For the complete feature reference, see MaxCompute data source.
Billing
Data synchronization through Data Integration requires two types of resource groups:
Data Integration resource groups -- execute synchronization tasks.
Scheduling resource groups -- orchestrate and schedule tasks.
Both resource group types offer shared (multi-tenant) and exclusive (dedicated) options. Data transferred over the Internet may incur additional data transfer costs.
| Cost component | Billing model | Details |
|---|---|---|
| Exclusive resource groups for Data Integration | Subscription | Billing details |
| Shared resource groups for Data Integration (debugging) | Pay-as-you-go | Billing details |
| Exclusive resource groups for scheduling | Subscription | Billing details |
| Shared resource groups for scheduling | See documentation | Billing details |
| Internet data transfer | Usage-based | Internet traffic billing |