All Products
Search
Document Center

DataWorks:Analyze run logs generated for a batch synchronization task

Last Updated:Dec 26, 2023

This topic describes how to view the run logs that are generated for a batch synchronization task.

Go to the log details page

You can view the run logs that are generated for a batch synchronization task in Operation Center or DataStudio.

Service

Description

Operation Center

You can go to the Cycle Instance, Test Instance, or Patch Data page in Operation Center, specify the filter conditions to search for the instance that is generated for the batch synchronization task, and then go to the log details page of the instance. For more information, see View auto triggered task instances, Backfill data for an auto triggered task and view data backfill instances generated for the task, or Test an auto triggered task and view test instances generated for the task.

DataStudio

In the Operation History pane of the DataStudio page, you can view the run logs that are generated for the batch synchronization task within the last three days. For more information, see View operating history.

View run logs generated for a batch synchronization task

The following figure shows the run logs generated for a batch synchronization task in different stages. You can click the link provided in the area marked with 1 or 5 in the following figure to view the detailed logs generated for the batch synchronization task in each stage.日志

Stage

Keyword

Description

Commit the task (area marked with 1)

SUBMIT: The batch synchronization task is issued by the scheduling system to the resource group for Data Integration for running. This indicates that the batch synchronization task is rendered.

The scheduling system issues the batch synchronization task to the resource group for Data Integration for running. You can view the resource group for Data Integration in the area marked with 1. The information that is printed in the run logs varies based on the type of the resource group that you use.

  • If the batch synchronization task is run on the shared resource group for Data Integration, the run logs contain the following information:

    running in Pipeline[basecommon_ group_xxxxxxxxx]

  • If the batch synchronization task is run on an exclusive resource group for Data Integration, the run logs contain the following information:

    running in Pipeline[basecommon_S_res_group_xxx]

  • If the batch synchronization task is run on a custom resource group for Data Integration, the run logs contain the following information:

    running in Pipeline[basecommon_xxxxxxxxx]

Note

You can click the link next to Detail log url in the area marked with 1 in the preceding figure to view the detailed logs generated for the batch synchronization task in each stage.

Wait for resources (area marked with 2)

WAIT: The batch synchronization task is waiting for resources in the resource group for Data Integration.

If the batch synchronization task waits for resources in the resource group for Data Integration for a long period of time, other tasks may be running on the resource group and the idle resources in the resource group are insufficient to run the current task. In this case, you can use one of the following solutions to resolve the issue:

  • Start the batch synchronization task after the tasks that are running on the resource group for Data Integration finish running. After the tasks finish running, the resources in the resource group for Data Integration are released. For information about how to find the tasks that occupy resources, see Scenarios and solutions to slow data synchronization.

  • Find the tasks that compete for resources with the batch synchronization task, contact the owners of the tasks, and then ask the owners to reduce the parallel threads for the tasks.

  • Reduce the parallel threads that you specified for the batch synchronization task. Then, commit and deploy the task again.

  • Scale out the resource group for Data Integration. For more information, see Scale out or in a resource group.

Run the task (area marked with 3)

RUN: The batch synchronization task is running.

A batch synchronization task runs in the following stages:

  1. Execute the configured SQL statement before data synchronization

    If you configure the SQL statement that you want to execute before data synchronization for the batch synchronization task, the system issues the SQL statement to the related database and executes the SQL statement on the database. If you do not configure the SQL statement that you want to execute before data synchronization, this stage is skipped.

    • If you configure the SQL statement that you want to execute before data synchronization for a batch synchronization task that uses MySQL Writer, the system issues the SQL statement to the related database and executes the SQL statement on the database in this stage.

    • If you configure a condition that is used for refined data filtering or a WHERE clause for a batch synchronization task that uses MySQL Reader, the system issues the related SQL statement to the related database and executes the SQL statement on the database in this stage.

    • If you set the write mode to deleting existing data before data write for a batch synchronization task that uses MaxCompute Writer, the system executes the related SQL statement to delete existing data from the destination table before new data is written to the table.

    Note

    We recommend that you use an indexed field for data filtering. This prevents the batch synchronization task from requiring a long period of time to run because the configured SQL statement is executed on the related database for an extended period of time. This also prevents the batch synchronization task from failing because the execution of the SQL statement on the related database times out.

  2. Shard data in the source

    In this stage, data in the source is sharded and distributed to multiple shards. This way, the batch synchronization task can run parallel threads to read the data in batches. Data in the source is sharded based on the following rules:

    • If data is read from a relational database, the data is sharded based on the shard key that you specified and the batch synchronization task runs parallel threads to read the data in batches. If no shard key is specified, the batch synchronization task runs a single thread to read the data.

    • If data is read from a LogHub, DataHub, or MongoDB data source, the data is sharded based on the number of shards in the related data source. The maximum number of parallel threads that the batch synchronization task uses to read the data cannot exceed the number of shards.

    • If data is read from a semi-structured data source, the data is sharded based on the number of files or the data volume. For example, if data is read from an Object Storage Service (OSS) data source, the data is sharded based on the number of objects in the related OSS bucket. The maximum number of parallel threads that the batch synchronization task uses to read the data cannot exceed the number of the objects.

  3. Synchronize data

    In this stage, the batch synchronization task runs the specified number of parallel threads to read the sharded source data in batches. If data is read from a relational database, the system generates multiple SQL statements based on the shard key that you specified. The SQL statements are used to read the data from the database in batches.

    Note
    • During data synchronization, the number of parallel threads that are actually run by the batch synchronization task may not be the same as the number of parallel threads that you specified.

    • If an inappropriate shard key is specified, the following situations may occur: The batch synchronization task requires a long period of time to run because the SQL statements that are generated based on the shard key to read data from the source are executed on the related database for an extended period of time. The batch synchronization task fails because the execution of the SQL statements on the related database times out.

    • If the loads on the related database are high, the batch synchronization task may require an extended period of time to run.

  4. Execute the configured SQL statement after data synchronization

    If you configure the SQL statement that you want to execute after data synchronization for the batch synchronization task, the system issues the SQL statement to the related database and executes the SQL statement on the database after data synchronization. If you do not configure the SQL statement that you want to execute after data synchronization, this stage is skipped.

    • If you configure the SQL statement that you want to execute after data synchronization for a batch synchronization task that uses MySQL Writer, the system issues the SQL statement to the related database and executes the SQL statement on the database after data synchronization in this stage.

    • The time that is required by the batch synchronization task to run is also affected by the execution time of the SQL statement that you want to execute after data synchronization.

Finish running (area marked with 4)

After the batch synchronization task finishes running, one of the following keywords is printed in the run logs:

  • FAIL: The batch synchronization task fails.

  • SUCCESS: The batch synchronization task is successfully run.

  • If the batch synchronization task fails, the key error message is printed in the run logs. You can click the link provided in the area marked with 5 to view the details of the batch synchronization task in each stage.

  • If the batch synchronization task is successfully run, the following information is printed in the run logs: the total number of synchronized data records, the total volume of synchronized data, and the average data synchronization speed.

Note
  • If dirty data is generated during data synchronization, Dirty data: xxR is printed in the run logs and the dirty data is not written to the destination.

  • If a large amount of dirty data is generated during data synchronization, the data synchronization speed is affected. If you have a high requirement for the data synchronization speed, we recommend that you handle the dirty data issue at the earliest opportunity after the dirty data is generated. For more information about dirty data, see Overview of the batch synchronization feature.

  • You can specify the maximum number of dirty data records that are allowed during data synchronization to control the impacts of dirty data on your batch synchronization task. By default, batch synchronization tasks allow the generation of dirty data. You can modify the settings related to dirty data on the configuration tab of your batch synchronization task. For information about how to configure a batch synchronization task by using the codeless user interface (UI), see Configure a batch synchronization task by using the codeless UI. For more information about how to configure a batch synchronization task by using the code editor, see Configure a batch synchronization task by using the code editor.

View the detailed logs (area marked with 5)

Link that is provided in the area marked with 5.

You can click the link that is provided in the area marked with 5 to view the detailed logs generated for the batch synchronization task in each stage.

Configuration of a shard key for a relational database

  • We recommend that you set the shard key to the name of the primary key column of the source table. This way, data can be evenly distributed into different shards based on the primary key column, instead of being intensively distributed only into specific shards.

  • A shard key can be used to shard data only of an integer data type. If you use a shard key to shard data of an unsupported data type, the batch synchronization task ignores the shard key that you specified and uses a single thread to read data.

  • If no shard key is specified, the batch synchronization task uses a single thread to read data.