After you configure data sources, network environments, and resource groups, you can create and run sync nodes. This topic describes how to configure a sync solution to synchronize full data to MaxCompute on a regular basis and view the status of the nodes generated by the sync solution.
The One-click batch synchronization to MaxCompute (Cyclical Full) solution is applicable to the scenarios in which you need to synchronize full data from specific tables to MaxCompute on a regular basis. This solution is suitable for periodic data synchronization from a large number of tables. You can synchronize source tables in batches to reduce the load. The scheduling time is flexible with many options to facilitate periodic data synchronization.
Configure a sync solution
- Go to the Create Data Synchronization Solution wizard. Select the source and the destination
for data synchronization from the drop-down lists. In this scenario, select MaxCompute
as the destination. After that, select One-click batch synchronization to MaxCompute (Cyclical Full) from the available sync solutions. For more information, see Select a data synchronization solution.
- Configure network connection for data synchronization. Select the data source, exclusive resource group for Data Integration, and destination data source as prompted, and then test the network connectivity. After that, click Next Step. You must prepare the exclusive resource groups and the network connection solution that you want to use. In addition, you must create connections to data sources in DataWorks and configure network connectivity as required, such as a whitelist. This avoids failures in connectivity tests. For more information, see Plan and configure resources.
- Configure the source and rules for data synchronization.
- Configure the basic information. In the Basic Configuration section, set the parameters that are described in the following table.
Parameter Description Solution Name The name of the sync solution. The name can be a maximum of 50 characters in length. Description The description of the sync solution. The description can be a maximum of 50 characters in length. Location If you select Automatic Workflow Creation, DataWorks automatically creates a workflow named in the format of clone_database_Source name+to+Destination name. All sync nodes generated by the sync solution are placed in the Data Integration folder of this workflow.
If you clear Automatic Workflow Creation, you must select a directory from the Select Location drop-down list. All sync nodes generated by the sync solution are placed in the specified directory.
- Check the data source information. The information about the data source selected in the preceding step is displayed in the Data Source section, and the encoding type is specified by default. You must check the information and determine whether to change the encoding type.
- Select the source tables for synchronization. Select the tables whose data you want to synchronize to MaxCompute as prompted. After you select the source tables, data is synchronized from the selected tables to MaxCompute based on the configuration of this sync solution.Notice If a selected table does not have a primary key, the table data cannot be synchronized in real time.
- Configure mapping rules for the names of the source and destination tables. Click Add rule, select a rule type, and then configure the mapping rules. Supported rule types are Conversion Rule for Table Name and Rule for Destination Table name.
- Conversion Rule for Table Name: the rule used to convert the names of source tables to those of destination tables.
- Rule for Destination Table name: the rule used to add a prefix or a suffix to the converted names of destination tables.
- Click Next Step.
- Configure the basic information.
- Configure the destination.
- The destination data source selected in the preceding step is displayed on the page. Check whether the displayed information is valid.
- Click the icon next to Time automatic partition setting. In the Edit dialog box, modify the partition settings for the destination tables. You can configure daily partitions.
- Click Refresh source table and MaxCompute Table mapping to create the mappings between the source tables and destination MaxCompute tables.
- View the mapping progress, source tables, and mapped destination tables.
No. Description 1The progress of mapping the source tables to the destination tables.Note The mapping may require a long period of time if you want to synchronize data from a large number of tables. 2 The source of the destination table. Valid values: Create Table and Use Existing Table. 3The name of the destination table. The table name that appears varies based on the value that you selected from the drop-down list in the Table creation method column.
- If you set the Table creation method parameter to Create Table, the name of the destination table that is automatically created appears. You can click the table name to view and modify the table creation statements.
- If you set the Table creation method parameter to Use Existing Table, you must select a table name from the drop-down list in the MaxComputeTable name column.
4 If a source table does not have a primary key, an error message appears to remind you that the current source table does not have a primary key and cannot be synchronized. The synchronization can be performed if one of the selected source tables has a primary key. Source tables without primary keys are ignored during the synchronization.
- Click Next Step.
- Configure synchronization rules.
- Configure rules for full data synchronization.
Parameter Description Clear the corresponding original table before writing Enable this feature as needed. If you enable this feature, the previously synchronized tables are deleted from MaxCompute each time before data is synchronized. We recommend that you enable this feature with caution. Synchronous concurrency configuration You can specify whether to synchronize source tables in batches or at a time. We recommend that you select Batch Upload to synchronize a large number of source tables in batches. This prevents a heavy load from affecting data synchronization. Interval for batch upload
Specify the number of tables to be synchronized after each interval of time.
If you set the Synchronous concurrency configuration parameter to Batch Upload, you must specify the number of tables to be synchronized after each interval of time. The interval can be at least 15 minutes or several hours.
For example, you can set the scheduling time to 05:00 every day, and 300 source tables are to be synchronized. Data synchronization can last for a maximum of 19 hours from 05:00 to the end of the same day based on the recurrence. To prevent a heavy load, you can divide 300 source tables into six batches and set the interval to three hours. This way, data synchronization starts at 05:00, and data is synchronized from 50 tables for each batch every three hours.Note You need to set the interval for batch upload based on the recurrence of data synchronization. The sum of the intervals for batch upload must be less than the available duration for data synchronization. In the preceding example, six batches of source tables are synchronized to MaxCompute every three hours. Therefore, the sum of the intervals for batch upload is 18 hours, which is less than 19 hours, the available duration for data synchronization.
- Configure the recurrence for data synchronization. Set the parameters to configure the recurrence of data synchronization as needed, such as the Recurrence, Run At, and Scheduling Period parameters. The configuration of the scheduling parameters is similar to that for a regular node. For more information, see Configure time properties.
- Click Next Step.
- Configure rules for full data synchronization.
- Configure the resources required by the sync solution. In the Set Resources for Solution Running step, check the name of the sync node to be generated by the sync solution, the resource group for Data Integration, and the resource group for scheduling. Then, set the Maximum number of connections supported by source read parameter.Note The Maximum number of connections supported by source read parameter specifies the maximum number of Java Database Connectivity (JDBC) connections that are allowed for the source. You must set this parameter based on the capabilities of the source. If the specified number of connections is beyond the capabilities of the source, the data may not be read from the source.
- Click Complete Configuration. The sync solution is configured.
Run the sync solution
On the Tasks page, find the configured sync solution and click Submit and Run in the Operation column to run the sync solution.
View the running status and result of the data synchronization nodes
- On the Task list page, find the solution that is run and choose More > Execution details in the Operation column. Then, you can view the running details of all nodes.
- Find a node whose running details you want to view and click Execution details in the Status column. In the dialog box that appears, click the provided link to go to the DataStudio page.