Use Check nodes to verify data source or real-time synchronization task availability - DataWorks

The DataWorks Check node verifies the availability of a target object, such as a MaxCompute partitioned table, an FTP file, an OSS file, an HDFS file, an OSS-HDFS file, or a real-time synchronization task. The Check node succeeds once its check policy is met. If a task depends on a target object, use a Check node to verify the object's availability and make the task a downstream dependency of the Check node. Once the check passes, the Check node succeeds and triggers the downstream task. This topic describes the objects a Check node can check, the available policies, and how to configure the node.

Supported objects and check policies

A Check node can check only data sources and real-time synchronization tasks. The check policies are as follows:

Data source
- MaxCompute partitioned table or DLF (Paimon partitioned table)
  Note
  Check nodes support MaxCompute partitioned tables but do not support MaxCompute non-partitioned tables.
  Check nodes provide the following two check policies to determine whether the data in a MaxCompute partitioned table is ready.
  - Policy 1: Check if the target partition exists
    If the target partition exists, the Check node considers data generation complete and the data available.
  - Policy 2: Check if the target partition has been updated within a specified period
    If the target partition has not been updated within the specified period, the Check node considers data generation complete and the data available.
- FTP, OSS, HDFS, or OSS-HDFS file
  If the target file exists, the Check node considers it available.
Real-time synchronization task
The check is based on the scheduled start time of the Check node. If the real-time synchronization task has finished writing data by that time, the check passes.

In addition, you must specify the check interval (the time between consecutive checks) and a stop condition (the maximum number of checks or a check deadline). If the task reaches the maximum number of checks or the check deadline is met but the check has not passed, the Check node fails. For more information about how to configure these policies, see Step 2: Configure check policies.

Note

A Check node periodically checks a target object. You must configure the scheduling time for the Check node based on the expected start time of the check. After the scheduling conditions are met, the Check node remains in a running state until the check passes or fails based on the stop policy. For more information about schedule settings, see Step 3: Configure task scheduling.
A Check node occupies scheduling resources until the check completes.

Limitations

Resource group restrictions: Check node tasks can run only on a serverless resource group. For more information about how to purchase and use a serverless resource group, see Use a serverless resource group.
Data source restrictions: FTP data sources with the Protocol set to SFTP and authenticated by Key are not supported. For more information, see Create an FTP data source.
Node feature restrictions
- A Check node can check only one object. If your task depends on multiple objects (for example, multiple MaxCompute partitioned tables), you must create multiple Check nodes to verify each object separately.
- The check interval of a Check node ranges from 1 minute to 30 minutes.
DataWorks edition restrictions: Check nodes are supported only in DataWorks Professional Edition and higher editions. If you are using a lower edition, see Version upgrade description to upgrade.
Supported regions: Check nodes are available in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).

Prerequisites

When a Check node verifies a data source, you must create the corresponding data source before using the Check node. The following table lists the required preparations.

Check object type	Preparation	Reference
MaxCompute partitioned table	A MaxCompute compute resource has been created and associated with Data Studio. When you create and associate a MaxCompute compute resource in DataWorks, a MaxCompute data source is automatically created. A MaxCompute partitioned table has been created.	Associate a MaxCompute compute resource Preparations before data development: Associate a data source or a cluster with DataStudio Create and use MaxCompute tables
FTP file	An FTP data source has been created. In DataWorks, you must create an FTP service as an FTP data source before you can access the data in the FTP service through the data source.	Create an FTP data source
OSS file	An OSS data source has been created with the access mode set to Access Key. In DataWorks, you must create an OSS bucket as an OSS data source before you can access the data in the bucket through the data source. Note Currently, only the Access Key mode is supported for accessing OSS data sources in a Check node. OSS data sources configured with the RAM role authorization mode cannot be used with Check nodes.	Create a bucket Create an OSS data source
HDFS file	An HDFS data source has been created. In DataWorks, you must create an HDFS file as an HDFS data source before you can access the HDFS file data through the data source.	Create an HDFS data source
OSS-HDFS file	An OSS-HDFS data source has been created. In DataWorks, you must create an OSS-HDFS service as an OSS-HDFS data source before you can access the data in the OSS-HDFS service through the data source.	OSS-HDFS

When a Check node verifies a real-time synchronization task, only real-time synchronization tasks from Kafka to MaxCompute are supported. Before you use a Check node, create the corresponding real-time synchronization task. For more information, see Create a real-time synchronization task.

Step 1: Create a Check node

Log on to the DataWorks console. In the target region, click Data Development and O&M > Data Development in the left-side navigation pane. Select a workspace from the drop-down list and click Go to Data Development.
Click the icon and choose Create Node > General > Check Node.
Follow the on-screen instructions to specify the path, name, and other information for the node.

Step 2: Configure check policies

You can configure the Check node to check a data source or a real-time synchronization task based on your business requirements and configure the corresponding policies.

Data source

Configure check policies for a MaxCompute partitioned table

The following table describes the parameters.

Parameter	Description
Data Source Type	Select MaxCompute.
Data Source Name	The data source where the MaxCompute partitioned table to be checked resides. If no data source is available, click New data source to create one. For more information about how to create a MaxCompute data source, see Create a MaxCompute data source.
Table Name	The MaxCompute partitioned table to be checked. Note Only MaxCompute partitioned tables within the selected data source can be selected.
Partition	The partition of the MaxCompute table to be checked. After you configure the Table Name parameter, you can preview the table information to view partition names. You can also use scheduling parameters to obtain partition names. For more information about how to use scheduling parameters, see Configure scheduling parameters.
Condition for Check Passing	Defines the check method and pass condition for the partitioned table. You can use one of the following two methods: Partition exists: Check whether the target partition exists. Exists: The check passes, and the partitioned table is considered available. Does not exist: The check does not pass, and the partitioned table is considered unavailable. Verify based on LastModifiedTime: Check whether the target partition data has been updated within a specified period. No update: The check passes, and the partition data is considered fully written. The partitioned table is available. Updated: The check does not pass, and the partition data is considered not fully written. The partitioned table is unavailable. Note You can only check whether the partition data has been updated within 5, 10, 15, 20, 25, or 30 minutes. For more information about LastModifiedTime, see LastModifiedTime.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Configure check policies for an FTP file

The following table describes the parameters.

Parameter	Description
Data Source Type	Select FTP.
Data Source Name	The data source where the FTP file to be checked resides. If no data source is available, click New data source to create one. For more information about how to create an FTP data source, see Create an FTP data source.
Object Path	The path of the FTP file to be checked. Example: `/var/ftp/test/`. If the specified path exists, the file at that path exists. You can enter the path directly or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Configure scheduling parameters.
Condition for Check Passing	Defines the pass condition for the FTP file check. If the FTP file exists, the check passes and the FTP file is considered available. If the FTP file does not exist, the check does not pass and the FTP file is considered unavailable.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Configure check policies for an OSS file

In addition, configure the following parameters: Set Check Object to Data Source. Set Check Pass Condition to File Exists. For Check Stop Policy, select Check Stop Time or Check Stop Count, and set the check interval (for example, every 5 minutes) and the deadline or count. Note: If an upstream task is delayed and the Check task actually starts running after the configured check deadline, the Check task still runs but performs only one check.

The following table describes the parameters.

Parameter	Description
Data Source Type	Select OSS.
Data Source Name	The data source where the OSS file to be checked resides. If no data source is available, click New data source to create one. For more information about how to create an OSS data source, see Create an OSS data source.
Object Path	The path of the OSS file to be checked. You can log on to the OSS console, go to the details page of the target bucket, and view the path on the File Management > Files > OSS Object page. Note The system uses the bucket information configured in the OSS data source that you selected for Data Source Name by default. When you enter the path, do not include the `oss://` prefix or the bucket information. We recommend that you do not start the path with a forward slash `/`. The format follows the OSS file path format definition: If the file path ends with "/", the Check node checks whether a folder with the same name as the input path exists in OSS. Example: user/, which checks whether the user folder exists. If the file path does not end with "/", the Check node checks whether a file with the same name as the input path exists in OSS. Example: user, which checks whether the user file exists. Folder check limitations: If you want to check whether a folder exists, note the following scenarios. Direct file upload (such as `put /a/b/1.txt`): The `/a` or `/a/b` folders are not created. Only the object `/a/b/1.txt` is created. The console displays a virtual folder `/a/b/`, which does not actually exist when checked. Step-by-step path upload (such as `put /a` → `put /a/b` → `put /a/b/1.txt`): The `/a` and `/a/b` folder objects (0 KB in size) are created along with the file object. In this case, the folders exist when checked.
Condition for Check Passing	Defines the pass condition for the OSS file check. If the OSS file exists, the check passes and the OSS file is considered available. If the OSS file does not exist, the check does not pass and the OSS file is considered unavailable.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Configure check policies for an HDFS file

If an upstream task is delayed and the Check task actually starts running after the configured check deadline, the Check task still runs but performs only one check. The following table describes the parameters.

Parameter	Description
Data Source Type	Select HDFS.
Data Source Name	The data source where the HDFS file to be checked resides. If no data source is available, click New data source to create one. For more information about how to create an HDFS data source, see Create an HDFS data source.
Object Path	The path of the HDFS file to be checked. Example: `/user/dw_test/dw`. If the specified path exists, the file at that path exists. You can enter the path directly or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Configure scheduling parameters.
Condition for Check Passing	Defines the pass condition for the HDFS file check. If the HDFS file exists, the check passes and the HDFS file is considered available. If the HDFS file does not exist, the check does not pass and the HDFS file is considered unavailable.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Configure check policies for an OSS-HDFS file

Set Check Object to Data Source. In the Check Stop Policy section, select Check Stop Time or Check Stop Count. By default, the check runs every 5 minutes. If an upstream task is delayed and the Check task actually starts running after the configured check deadline, the Check task still runs but performs only one check. The following table describes the parameters.

Parameter	Description
Data Source Type	Select OSS_HDFS.
Data Source Name	The data source where the OSS-HDFS file to be checked resides. If no data source is available, click New data source to create one. For more information about how to create an OSS-HDFS data source, see Create an OSS-HDFS data source.
Object Path	The path of the OSS-HDFS file to be checked. You can log on to the OSS console, go to the details page of the target bucket, and view the path on the File Management > Files > HDFS Files page. The format follows the OSS-HDFS file path format definition: If the file path ends with `/`, the Check node checks whether a folder with the same name as the input path exists in OSS-HDFS. Example: `user/`, which checks whether the user folder exists. If the file path does not end with `/`, the Check node checks whether a file with the same name as the input path exists in OSS-HDFS. Example: `user`, which checks whether the user file exists.
Condition for Check Passing	Defines the pass condition for the OSS-HDFS file check. If the OSS-HDFS file exists, the check passes and the OSS-HDFS file is considered available. If the OSS-HDFS file does not exist, the check does not pass and the OSS-HDFS file is considered unavailable.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Real-time synchronization task

The following table describes the parameters.

Parameter	Description
Check Object	Select Real-time Synchronization Task.
Real-time Synchronization Task	The real-time synchronization task to be checked. Note Currently, only real-time synchronization tasks from Kafka to MaxCompute are supported. If a real-time synchronization task already exists but cannot be selected, check whether the task has been deployed to the production environment.
Policy for Stopping Check	Configures the stop policy for the Check node task. You can set a stop time or a stop count and configure the check frequency: Set stop time: Specify the duration and the check interval (the time between consecutive checks). If the duration expires and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. If an upstream task is delayed and the Check node task actually starts running after the configured check deadline, the Check node task still starts but performs only one check. Set stop count: Specify the maximum number of checks and the check interval (the time between consecutive checks). If the maximum number of checks is reached and the Check task has not passed, the task automatically exits with a failure status. Note The check interval ranges from `1~30` minutes. The maximum duration of a Check node task is 24 hours (1,440 minutes). The maximum number of checks depends on the check interval. For example, if the check interval is 5 minutes, the maximum number of checks is 288. If the check interval is 10 minutes, the maximum number of checks is 144. The actual values are displayed on the configuration page.

Step 3: Configure task scheduling

To periodically check partition data by using a Check node, click Scheduling on the right side of the node editing page and configure the schedule settings based on your business requirements. For more information, see Configure schedule settings.

Like a regular scheduled node, a Check node requires schedule settings such as dependencies and scheduling time. Every node in DataWorks must have an upstream dependency. If the Check node has no actual upstream dependency, you can choose to depend on a virtual node or the workspace root node based on the complexity of your workspace. For more information, see Configure dependencies.

Note

You must configure the Rerun attribute and Parent Nodes for the node before you can submit it.

Step 4: Submit and deploy the task

After you configure the node task, you must submit and deploy it. Once submitted and deployed, the node runs periodically based on the schedule settings.

Click the icon on the toolbar to save the node.
Click the icon on the toolbar to submit the node task.
When you submit the task, enter the Change Description in the Submission dialog and choose whether to perform code review and smoke testing after the node is submitted.
Note
- You must configure the Rerun attribute and Parent Nodes for the node before you can submit it.
- Code review helps you control the quality of task code and prevents errors caused by deploying faulty code without review. If code review is enabled, the submitted node code must be approved by a reviewer before it can be deployed. For more information, see Configure code review.
- To ensure that the scheduled node task runs as expected, we recommend that you perform smoke testing before deployment. For more information, see Smoke testing.

If you use a standard mode workspace, after the task is submitted, click Deploy in the upper-right corner of the node editing page to deploy the task to the production environment. For more information, see Deploy tasks.

Next step

After the Check node is submitted and deployed to Operation Center, it periodically runs checks based on the node configuration. You can view check results and perform related operations in Operation Center. For more information, see View and manage Check node instances.