All Products
Search
Document Center

DataWorks:Perform O&M on a full and incremental synchronization task

Last Updated:Feb 23, 2024

After a synchronization task is configured, you can manage the task and view the running details of the task. This topic describes common O&M operations that can be performed on a full and incremental synchronization task.

Background information

This topic describes only common O&M operations that can be performed on a full and incremental synchronization task. For information about how to perform O&M operations on a real-time synchronization subtask and a batch synchronization subtask that are generated by a full and incremental synchronization task, see O&M for real-time synchronization nodes and O&M for batch synchronization nodes.

Manage a full and incremental synchronization task

After a full and incremental synchronization task is configured, you can go to the Nodes page in Data Integration in the DataWorks console to view the synchronization task. This page displays all created synchronization tasks. You can specify filter conditions to search for the desired synchronization task. Then, you can perform the operations that are described in the following table on the synchronization task.

Operation

Description

Start

You can click Commit and Run in the Actions column of the synchronization task to start the synchronization task.

Edit

In a business scenario, your business in the production environment may evolve over time. Your business tables may increase or decrease. In this case, you may need to adjust the number of business tables from which you want to synchronize data. Data Integration allows you to adjust the source tables that are specified in your synchronization task. You can click More in the Actions column of the synchronization task and select Modify Configuration to go to the configuration page of the synchronization task. On the configuration page, you can add or remove source tables based on your business requirements. After the adjustment is complete, you can go back to the Nodes page, find the synchronization task, and then click Commit and Run in the Actions column of the synchronization task to run the synchronization task. When you rerun the synchronization task, the system compares the source tables specified in the synchronization task in this run with the source tables specified in the synchronization task in the previous run. If new tables are detected, the system runs the synchronization task to synchronize data from the new tables. For more information, see Add or remove source tables to or from a synchronization solution that is running.

If you run a one-click real-time synchronization task, the synchronization task synchronizes full data from the newly added tables. After the full data is synchronized, the system runs the real-time synchronization subtask generated by the synchronization task to synchronize incremental data from the newly added tables in real time.

Note
  • After you add source tables to the synchronization task and start the synchronization task, the system first synchronizes full data in the added source tables and then synchronizes incremental data in the added source tables from the point in time when the full data starts to be synchronized. For example, your synchronization task starts to run at 08:00 and is still running at 09:00. You add a source table to the synchronization task at 09:00. The system starts to run the synchronization task to synchronize full data from the table from 09:00, and the synchronization ends at 10:00. Then, the system stops the real-time synchronization subtask that is running and starts to synchronize the incremental data that is generated from 09:00 to 10:00 in the table to the destination table. If you add source tables to a synchronization task after the synchronization task is run, the system ensures only the consistency between data before and after the synchronization. Data may be inconsistent during the synchronization.

  • If you want to synchronize full data from all source tables specified in a synchronization task, you must forcefully rerun the synchronization task.

Forcefully rerun

In some special cases, you can click More in the Actions column of the synchronization task and select Force Rerun to rerun the synchronization task. For example, if data in the source is contaminated or errors occur on data links, you can perform the forcible rerun operation. After you forcefully rerun the synchronization task, the system synchronizes full data and incremental data from the source to the destination again.

Note
  • Only synchronization tasks that are used to synchronize data to Hologres and MaxCompute can be forcefully rerun.

  • A synchronization task that is used to synchronize data from tables in sharded databases cannot be forcefully rerun.

In the following scenarios, a one-click real-time synchronization task used to synchronize data to MaxCompute needs to be rerun to restore data:

  • The source of the synchronization task is a MySQL data source, and the real-time synchronization subtask generated by the synchronization task fails for a long period of time. As a result, the binary logs are deleted, and incremental data in the MySQL data source cannot be synchronized.

  • Destination tables do not contain the fields that are newly added to source tables due to various reasons.

  • Data accuracy issues such as data loss occur on data synchronized to destination tables due to various reasons.

Important
  • If you forcefully rerun a synchronization task, the synchronization task synchronizes data from the fields in source tables to the fields in destination tables again. If some fields in source tables do not exist in destination tables, the system automatically adds the same fields in the destination tables to ensure field consistency.

  • Before you forcefully rerun a synchronization task, you must check whether the rerun operation will lead to a conflict between the instances of the merge subtasks generated by the synchronization task before and during the forceful rerun. When you forcefully rerun the synchronization task, the instance of the merge subtask generated before the forceful rerun may be running or be going to run. If the data timestamps of the instances are the same and the instances are run at the same time, data in destination partitions or tables may overwrite each other.

    You can go to the Cycle Instance page in Operation Center and view the running situation of the instance of the merge subtask that is generated by the synchronization task before the forceful rerun. If the rerun operation will lead to a conflict between the instances of the merge subtasks generated by the synchronization task before and during the forceful rerun, you can perform one of the following operations to resolve the issue:

    • If the instance of the merge subtask that is generated by the synchronization task before the forceful rerun is running, rerun the synchronization task after the instance finishes running.

    • If the instance of the merge subtask that is generated by the synchronization task before the forceful rerun has not started to run, freeze the instance. Unfreeze the instance after the rerun operation is complete.

  • If data is not generated or the automatic scheduling of the merge subtask is not resumed on the next day after you forcefully rerun a synchronization task, you must check whether the following issues exist and manually resume the scheduling of the instance of the merge subtask:

    • If latency occurs on the synchronization task, resolve the latency issue. For more information, see Solutions to latency on a real-time synchronization node.

    • If the instance of the merge subtask in the previous cycle is not run or failed to be run, you can remove the dependency of the instance of the merge subtask in the current cycle on the instance of the merge subtask in the previous cycle. For information about how to view the information of an auto triggered instance, see View auto triggered node instances.

Backfill full data

You can perform this operation if you need to synchronize full data from the source again to resolve data accuracy issues, such as data loss, that occur on the data synchronized to MaxCompute tables in the synchronization task.

Note
  • Only one-click real-time synchronization tasks used to synchronize data to MaxCompute support full data backfill.

  • Synchronization tasks that are used to synchronize data from tables in sharded databases do not support full data backfill.

To backfill full data for a one-click real-time synchronization task used to synchronize data to MaxCompute, find the synchronization task on the Nodes page in Data Integration, click More in the Actions column, and then select Backfill Full Data.

  1. Select the data timestamp of the data backfill instance.

    If destination MaxCompute tables are partitioned tables, the synchronization task synchronizes full data from the source to the date partitions that are specified by the data timestamp.

  2. Select source tables based on which you want to backfill full data.

    In the list on the left, select the tables from which you want to synchronize full data. Click the Icon icon to move the selected tables to the list on the right.

  3. Click OK.

Important
  • You can select only a single day as a data timestamp. If you want to backfill full data for multiple days, you must perform the full data backfill operation multiple times.

  • A one-click full synchronization task synchronizes data from the source fields whose names are the same as destination fields and the additional source fields defined in the synchronization task.

  • Before you backfill full data for a one-click real-time synchronization task used to synchronize data to MaxCompute, you must check the data timestamp of the data backfill instance. You must make sure that the data backfill instance does not conflict with the instance of the merge subtask generated before the full data backfill operation. When you backfill full data for the synchronization task, the instance of the merge subtask generated before the full data backfill may be running or be going to run. If the data timestamps of the instances are the same and the instances are run at the same time, data in destination partitions or tables may overwrite each other.

    You can go to the Cycle Instance page in Operation Center and view the running situation of the instance of the merge subtask. If the data backfill instance conflicts with the instance of the merge subtask, you can perform one of the following operations to resolve the issue:

    • If the instance of the merge subtask is running, backfill full data for the synchronization task after the instance finishes running.

    • If the instance of the merge subtask has not started to run, freeze the instance. Unfreeze the instance after the full data backfill operation is complete.

Stop

If the synchronization task is running and you want to stop the running of the synchronization task, you can click Stop in the Actions column of the synchronization task.

View the status overview of synchronization tasks

You can go to the Running Status Overview page in Data Integration and specify a period of time to view the status overview of synchronization tasks. The Running Status Overview page contains the following sections:

  • Solution Status Distribution: displays the total number of synchronization tasks and displays the status distribution of the synchronization tasks in a pie chart. The statistical data about the status distribution shows the number of synchronization tasks that are successfully run and the number of synchronization tasks that fail to be run. The statistical data is collected in the specified period of time. You can click a sector in the pie chart to go to the synchronization task list page. On this page, you can view the synchronization tasks that are successfully run or fail to be run, and the running details of a synchronization task. For more information about the running details of a synchronization task, see View the running details of a synchronization task.

  • Usage of Resources in Resource Groups: displays the specifications and resource usage of the resource groups that are used within the current Alibaba Cloud account. You can click the name of a resource group to go to the details page of the resource group. On the details page, you can view the basic information and resource usage of the resource group. For information about resource groups, see View the resource usage of an exclusive resource group.

  • Batch Synchronization Nodes: displays the number of batch synchronization subtasks generated by specific synchronization tasks, the data synchronization speed, the status distribution of the batch synchronization subtasks, and the details of the synchronized data. The statistical data is collected in the specified period of time.

    • The statistical data about the status distribution shows the number of the batch synchronization subtasks that are successfully run and the number of the batch synchronization subtasks that fail to be run.

    • The Synchronization Data subsection displays the following items:

      • Number of synchronization subtasks: the number of batch synchronization subtasks that are successfully run

      • Amount of data synchronized: the amount of data synchronized by batch synchronization subtasks that are successfully run or running

      • Number of data records synchronized: the number of data records that are synchronized by batch synchronization subtasks

    Note

    The statistical data in the Batch Synchronization Nodes section is updated per hour.

  • Real-time Synchronization Nodes: displays the number of real-time synchronization subtasks generated by specific synchronization tasks, the data synchronization speed, the status distribution of the real-time synchronization subtasks, and the top 10 subtasks with the highest latency. You can click the name of a subtask to go to the Real Time DI page and view the details of the subtask.

View the running details of a synchronization task

You can click Data Synchronization Node in the left-side navigation pane of the Data Integration page to go to the Nodes page.

On the Nodes page, you can view information, such as the type and name, of a synchronization task and the operations that you can perform on the synchronization task. You can also click Running Details in the Actions column of a synchronization task to view the running details of the synchronization task. The Running Details page contains the following sections:

  • Process: displays information such as the status of environment preparation, batch synchronization subtasks, and the real-time synchronization subtask. You can check whether the subtasks are run as expected based on their status. This way, you can troubleshoot the issues that occur on the synchronization task at the earliest opportunity. The following icons are used to indicate different states:

    • If the Succeeded icon is displayed, the subtask is successfully run.

    • If the Exception icon is displayed, the subtask failed to be run.

    • If the Waiting icon is displayed, the subtask is waiting to be run.

  • Full Batch Synchronization and Real-time Synchronization: display the information about the batch synchronization subtasks and the real-time synchronization subtask generated by the synchronization task. The information includes the source name, data synchronization speed, synchronized data, resource group that is used, and data synchronization latency.

  • Steps: displays all steps that are required to complete the synchronization task from subtask creation to running of batch synchronization subtasks and the real-time synchronization subtask. You can view the start time, end time, and status of each step in this section.