Use DataWorks Data Integration to run data synchronization jobs - MaxCompute

You can synchronize data from a data source to MaxCompute by using the Data Integration service provided by DataWorks. MaxCompute supports three types of synchronization methods: batch synchronization, real-time synchronization, and integrated synchronization. This topic describes how to synchronize data to MaxCompute by using the Data Integration service.

Batch synchronization

The Data Integration service provided by DataWorks allows you to define data sources or datasets as sources and destinations for data synchronization and use them with readers and writers to build a simple data synchronization framework. This way, you can synchronize structured data and semi-structured data from a data source to MaxCompute.

For more information about how to configure a batch synchronization task, see Configure a batch synchronization node by using the codeless UI and Configure a batch synchronization node by using the code editor.
Usage notes
- Batch synchronization allows you to synchronize data from a single table in a database or from tables in sharded databases to a single MaxCompute table.
- Before you configure a synchronization task, you need to add a MaxCompute data source on the Data Sources page in the DataWorks console. For more information, see Add a MaxCompute data source.
- Before you configure a synchronization task, you need to make sure that a network connection is established between a resource group for Data Integration and your data source. For more information, see Establish a network connection between a resource group and a data source.

Real-time synchronization

The real-time data synchronization feature provided by DataWorks allows you to synchronize incremental data in one or more tables in source databases to MaxCompute in real time. This implements data consistency between the MaxCompute tables and source databases in real time. When you run a real-time synchronization task, you can use multiple conversion plug-ins to cleanse the source data and use multiple writers to write the cleansed data to your intended destination at the same time. You can synchronize incremental data from a single table to a single MaxCompute table, from tables in sharded databases to a single MaxCompute table, and from multiple tables in a database to multiple MaxCompute tables.

For more information about how to configure a real-time synchronization task, see Create a real-time synchronization node to synchronize incremental data from a single table and Configure a real-time synchronization node in DataStudio.
Usage notes
- Before you configure a synchronization task, you need to add a MaxCompute data source on the Data Sources page in the DataWorks console. For more information, see Add a MaxCompute data source.
- Purchase an exclusive resource group for Data Integration with appropriate specifications based on your requirements. For more information, see Create and use an exclusive resource group for Data Integration.
  Note
  No optimal value is provided for the concurrency of synchronization tasks in which exclusive resource groups of MaxCompute run. You must configure the concurrency based on the amount of instance data and the expected synchronization time. If you want to reduce the synchronization time, you can purchase the resource specifications that support the maximum number of concurrent threads. For more information about the resource specifications that are required for a single task, see Performance metrics.
- Before you configure a synchronization task, you need to make sure that a network connection is established between a resource group for Data Integration and your data source. For more information, see Establish a network connection between a resource group and a data source.
- Before you run a real-time synchronization task, you need to configure the environment in which the MaxCompute data source runs. For more information, see Prepare a MaxCompute environment.

Integrated synchronization

In actual practice, data synchronization is a complex operation and requires the use of multiple batch synchronization tasks, real-time synchronization tasks, and data processing tasks. In these scenarios, complex configurations are required.

To resolve this issue, DataWorks provides configurable synchronization solutions that are tailored for specific business scenarios. The solutions allow you to synchronize data to MaxCompute with a few clicks. For more information, see Create a real-time synchronization solution to synchronize data to MaxCompute and Create a batch synchronization solution to synchronize all data in a database to MaxCompute.
Usage notes
- Before you configure a synchronization task, you need to add a MaxCompute data source on the Data Sources page in the DataWorks console. For more information, see Add a MaxCompute data source.
- Purchase an exclusive resource group for Data Integration with appropriate specifications based on your requirements. For more information, see Create and use an exclusive resource group for Data Integration.
  Note
  No optimal value is provided for the concurrency of synchronization tasks in which exclusive resource groups of MaxCompute run. You must configure the concurrency based on the amount of instance data and the expected synchronization time. If you want to reduce the synchronization time, you can purchase the resource specifications that support the maximum number of concurrent threads. For more information about the resource specifications that are required for a single task, see Performance metrics.
- Before you configure a synchronization task, you need to make sure that a network connection is established between a resource group for Data Integration and your data source. For more information, see Establish a network connection between a resource group and a data source.
- Before you run a real-time synchronization task, you need to configure the environment in which the MaxCompute data source runs. For more information, see Prepare a MaxCompute environment.