All Products
Search
Document Center

DataWorks:Configure a Check node

Last Updated:Jul 08, 2025

DataWorks allows you to use a Check node to verify the availability of MaxCompute partitioned tables, File Transfer Protocol (FTP) files, Object Storage Service (OSS) objects, Hadoop Distributed File System (HDFS) files, OSS-HDFS objects, and real-time synchronization tasks based on check policies. If the condition specified in the check policy for a Check node is met, the task on the Check node runs successfully. If a task depends on an object, you can use a Check node to verify the availability of the object and configure the task as a descendant task of the Check node. When the condition specified in the check policy is met, the task on the Check node runs successfully and triggers its descendant task. This topic describes the supported check objects and policies, and explains how to configure a Check node.

Supported check objects and check policies

Check nodes can perform checks based only on data sources and real-time synchronization tasks. DataWorks supports the following check policies:

  • Data sources

    • MaxCompute partitioned tables

      Note

      You can use a Check node to verify the availability of a MaxCompute partitioned table rather than a MaxCompute non-partitioned table.

      The following check policies help you verify the data availability of a MaxCompute partitioned table:

      • Policy 1: Check whether a specified partition exists.

        If the partition exists, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.

      • Policy 2: Check whether data in a specified partition is updated within a specified period of time.

        If the data in the partition is not updated within the specified period of time, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.

    • FTP files, OSS objects, HDFS files, or OSS-HDFS objects

      If a specified FTP file, OSS object, HDFS file, or OSS-HDFS object exists, the system considers that the file or object is available.

  • Real-time synchronization tasks

    For this type of check object, the point in time when a Check node starts to be scheduled is used as the time for judgment. If a real-time synchronization task has finished synchronizing data that is generated at the point in time and generated earlier than the point in time, the system considers that the real-time synchronization task passes the check and is available.

Additionally, you must specify an interval at which a check is triggered and a condition for stopping a check task on the Check node. The condition can be the end time for the check or the maximum number of checks. If the task still fails the check after the specified time elapses or the maximum number of checks is reached, the Check node exits and enters the failed state. For more information about policy configuration, see Step 2: Configure a check policy.

Note
  • A Check node can periodically check target objects. You must configure its scheduling time based on the expected check start time. After the scheduling conditions are met, the node remains in the running state until the check condition is met and it returns a success status, or it returns a failure status if the check is not passed for a long period of time. For more information about scheduling configurations, see Step 3: Configure scheduling properties for the Check node.

  • A Check node occupies scheduling resources during the running process until the check is complete.

Limits

  • Limits on resource groups: You can use only Serverless resource groups to run tasks on Check nodes. For information about how to purchase and use a Serverless resource group, see Create and use a Serverless resource group.

  • Data source limits: You cannot use FTP data sources that use the SFTP protocol and are authenticated by using a Key.

  • Limits on node features

    • A Check node can check only one object. If your task depends on multiple objects, such as multiple MaxCompute partitioned tables, you must create multiple Check nodes to separately check these objects.

    • The check interval of a Check node ranges from 1 to 30 minutes.

  • Limits on DataWorks editions: You can use Check nodes only in DataWorks Professional Edition or a more advanced edition. To upgrade an earlier edition, see Edition upgrade and downgrade.

  • Supported regions: You can use Check nodes in workspaces that reside in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).

Prerequisites

  • Before you use a Check node to perform a check based on a data source, you must first prepare the data source that you want to use. The following table describes the details.

    Check object type

    Preparation

    References

    MaxCompute partitioned table

    1. A MaxCompute computing resource is created and attached to Data Development (DataStudio).

      When you create and attach a MaxCompute computing resource in DataWorks, a MaxCompute data source is automatically created.

    2. A MaxCompute partitioned table is created.

    FTP file

    An FTP data source is added.

    You must add the FTP service to a DataWorks workspace as an FTP data source before you can use the data source to access data of the FTP service.

    Create an FTP data source

    OSS object

    An OSS data source is added and the access mode of the data source is set to Access Key.

    You must add an OSS bucket to a DataWorks workspace as an OSS data source before you can use the data source to access data in the bucket.

    Note

    Currently, you can access an OSS data source only by using the Access Key mode in a Check node. You cannot use an OSS data source that is configured in RAM role authorization mode in a Check node.

    HDFS file

    An HDFS data source is added.

    You must add an HDFS file to a DataWorks workspace as an HDFS data source before you can use the data source to access data in the file.

    Create an HDFS data source

    OSS-HDFS object

    An OSS-HDFS data source is added.

    You must add the OSS-HDFS service to a DataWorks workspace as an OSS-HDFS data source before you can use the data source to access data of the OSS-HDFS service.

    OSS-HDFS data source

  • If you use a Check node to check a real-time synchronization task, only tasks that synchronize data from Kafka to MaxCompute are supported. Before you use a Check node, you must create the required real-time synchronization task. For more information, see Configure a real-time synchronization task in DataStudio.

Step 1: Create a Check node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Click the image.png icon, and select Create Node > General > Check Node.

    In the Create Node dialog box, configure the Path and Name parameters as prompted and click Confirm.

Step 2: Configure a check policy for the Check node

You can configure a check policy to perform a check based on a data source or a real-time synchronization task.

Data source

Configure a check policy for a MaxCompute partitioned table

image.png

The following table describes the parameters.

Parameter

Description

Data Source Type

Select MaxCompute.

Data Source Name

The name of the data source to which the MaxCompute partitioned table that you want to check belongs.

If no data source is available, you can click New Data Source to add a data source. For more information about how to add a MaxCompute data source, see Bind a MaxCompute computing resource.

Table Name

The name of the MaxCompute partitioned table that you want to check.

Note

You can select only a MaxCompute partitioned table that belongs to the specified data source.

Partition

The name of the partition in the MaxCompute partitioned table that you want to check.

After you configure the Table Name parameter, you can click Preview Table Information to obtain the partition name. You can also use scheduling parameters to obtain the partition name. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check method and check passing conditions of the partitioned table. Valid values:

  • Partition Existed: checks whether the specified partition exists.

    • If the partition exists, the partitioned table passes the check, and the system considers that the partitioned table is available.

    • If the partition does not exist, the partitioned table fails the check, and the system considers that the partitioned table is unavailable.

  • Last Modification Time Not Updated for Specific Duration: checks whether data in the specified partition is updated within a specified period of time. This method is used based on the LastModifiedTime parameter.

    • If data in the partition is not updated within the specified period of time, the partitioned table passes the check, and the system considers that the data write operation is complete and the partitioned table is available.

    • If data in the partition is updated within the specified period of time, the partitioned table fails the check, and the system considers that the data write operation is not complete and the partitioned table is unavailable.

    Note
    • You can check whether partition data is updated within 5, 10, 15, 20, 25, or 30 minutes.

    • For more information about the LastModifiedTime parameter, see Change the value of LastModifiedTime.

Check Stopping Policy

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an FTP file

image

The following table describes the parameters.

Parameter

Description

Data Source Type

Select FTP.

Data Source Name

The name of the data source to which the FTP file that you want to check belongs.

If no data source is available, you can click New Data Source to add a data source. For more information about how to add an FTP data source, see FTP data source.

File Path

The storage path of the FTP file that you want to check. Example: /var/ftp/test/.

If a specified path already exists, a file that has the same name exists.

You can directly enter a path or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check passing condition of the FTP file.

  • If the FTP file exists, the check is passed, and the system considers that the FTP file is available.

  • If the FTP file does not exist, the check is not passed, and the system considers that the FTP file is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an OSS object

image

The following table describes the parameters.

Parameter

Description

Data Source Type

Select OSS.

Data Source Name

The name of the data source to which the OSS object that you want to check belongs.

If no data source is available, you can click New Data Source to add a data source. For more information about how to add an OSS data source, see OSS data source.

File Path

The storage path of the OSS object that you want to check. You can perform the following operations to view the storage path of an OSS object: Log on to the Object Storage Service (OSS) console. Go to the details page of a desired bucket. In the left-side navigation pane of the details page, choose File Management > Files > OSS Objects.

The path must conform to the object path format of OSS:

  • If the path ends with a forward slash (/), the Check node checks whether a folder with the same name exists in OSS.

    Example: If the path is user/, the Check node checks whether a folder named user exists.

  • If the path does not end with a forward slash (/), the Check node checks whether an object with the same name exists in OSS.

    Example: If the path is user, the Check node checks whether an object named user exists.

Limitations on checking folders: If you want to check whether a folder exists, note the following two scenarios:

  • Direct file upload (such as put /a/b/1.txt): This operation does not create the /a or /a/b folders. It only creates the file object /a/b/1.txt. The console displays a virtual folder /a/b/, which does not actually exist when you check for the folder.

  • Step-by-step path upload (such as put /aput /a/bput /a/b/1.txt): This operation creates folder objects /a and /a/b (0KB in size). Both file and folder objects are created, and the folder exists when you check for it.

Note

After you select a data source, the system automatically uses the bucket that is configured in the data source. You do not need to specify bucket information in the path. After you enter a path, you can click View Complete Path Information to view the endpoint and bucket information of the OSS data source in the development environment.

Condition For Check Passing

Specifies the check passing condition of the OSS object.

  • If the OSS object exists, the check is passed, and the system considers that the OSS object is available.

  • If the OSS object does not exist, the check is not passed, and the system considers that the OSS object is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an HDFS file

imageThe following table describes the parameters.

Parameter

Description

Data Source Type

Select HDFS.

Data Source Name

The name of the data source to which the HDFS file that you want to check belongs.

If no data source is available, you can click New Data Source to add a data source. For more information about how to add an HDFS data source, see HDFS data source.

File Path

The storage path of the HDFS file that you want to check. Example: /user/dw_test/dw.

If a specified path already exists, a file that has the same name exists.

You can directly enter a path or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check passing condition of the HDFS file.

  • If the HDFS file exists, the check is passed, and the system considers that the HDFS file is available.

  • If the HDFS file does not exist, the check is not passed, and the system considers that the HDFS file is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an OSS-HDFS object

imageThe following table describes the parameters.

Parameter

Description

Data Source Type

Select OSS_HDFS.

Data Source Name

The name of the data source to which the OSS-HDFS object that you want to check belongs.

If no data source is available, click New Data Source to add a data source. For more information about how to add an OSS-HDFS data source, see OSS-HDFS data source.

File Path

The storage path of the OSS-HDFS object that you want to check. You can perform the following operations to view the storage path of an OSS-HDFS object: Log on to the Object Storage Service (OSS) console. Go to the details page of a desired bucket. In the left-side navigation pane of the details page, choose Object Management > Objects > HDFS Files.

The path must conform to the object path format of OSS-HDFS:

  • If the path ends with a /, the Check node checks whether a folder with the same name exists in OSS-HDFS.

    Example: If the path is user/, the Check node checks whether a folder named user exists.

  • If the path does not end with a /, the Check node checks whether a file with the same name exists in OSS-HDFS.

    Example: If the path is user, the Check node checks whether a file named user exists.

Condition For Check Passing

Specifies the check passing condition of the OSS-HDFS object.

  • If the OSS-HDFS object exists, the check is passed, and the system considers that the OSS-HDFS object is available.

  • If the OSS-HDFS object does not exist, the check is not passed, and the system considers that the OSS-HDFS object is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Real-time synchronization task

image

The following table describes the parameters.

Parameter

Description

Check Object

Select Real-time Synchronization Task.

Real-time Synchronization Task

The name of the real-time synchronization task that you want to check.

Note
  • Only real-time synchronization tasks that are used to synchronize data from Kafka to MaxCompute are supported.

  • If you have such a real-time synchronization task but cannot select it, check whether the real-time synchronization task is deployed to the production environment. If the real-time synchronization task is not deployed to the production environment, deploy the task first.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency:

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1~30 minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Step 3: Configure scheduling properties for the Check node

If you need to periodically use a Check node to check partition data, you can click Properties on the right side of the node configuration page to configure scheduling properties for the node based on your business requirements. For more information, see Overview of task scheduling property configuration.

You must configure scheduling properties such as scheduling dependencies and scheduling time for a Check node in the way that you configure scheduling properties for other types of nodes. Each node in DataWorks must have an upstream dependency. If a Check node does not have an actual upstream dependency, you can configure it to depend on a zero load node or the workspace root node based on the complexity of your business. For more information, see Create and use a zero load node.

Note

You must configure the Rerun and Parent Nodes parameters for the node before you can commit the node.

Step 4: Commit and deploy a task on the node

After you configure the node, commit and deploy it. The system will then periodically run the related task based on the scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the node.

  2. Click the 提交 icon in the top toolbar to commit the task on the node.

    In the Submit dialog box, enter a Change Description and select whether to perform code review and smoke testing after the node is submitted based on your business requirements.

    Note
    • You must configure the Rerun Property and Parent Nodes before you can commit a node.

    • You can use the code review feature to ensure the code quality of nodes and prevent node execution errors caused by invalid node code. If a code review is performed, the committed node code must be approved by reviewers before it can be deployed. For more information, see Code review.

    • To ensure that a task on a scheduling node runs as expected, we recommend that you perform smoke testing on the task before deployment. For more information, see Perform smoke testing.

If you use a workspace in standard mode, after you commit the task, you must also click Deploy in the upper-right corner of the node configuration tab to deploy the task to the production environment. For more information, see Deploy tasks.

What to do next

After a Check node is committed and deployed to Operation Center, it periodically runs checks based on its configurations. You can view the check results and perform related O&M operations in Operation Center of DataWorks. For more information, see Basic O&M operations for periodic tasks.