Use Check nodes to check the availability of MaxCompute partitioned tables, FTP files, or OSS objects - DataWorks

DataWorks allows you to use a Check node to check the availability of MaxCompute partitioned tables, FTP files, and Object Storage Service (OSS) objects based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as the descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run. This topic describes the supported check objects and check policies, and also describes how to configure a Check node.

Supported check objects and check policies

You can use Check nodes to check only MaxCompute partitioned tables, FTP files, and OSS objects. The following information describes check policies for each type of check object:

MaxCompute partitioned tables
The following check policies are provided based on a Check node to help you check the data availability of a MaxCompute partitioned table:
- Policy 1: Check whether a specified partition exists.
  If the partition exists, the system considers that the MaxCompute partitioned table is available.
- Policy 2: Check whether data in a specified partition is updated within a specified period of time.
  If the data in the partition is not updated within the specified period of time, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.
FTP files or OSS objects
If a specified FTP file or OSS object exists, the system considers that the FTP file or OSS object is available.

In addition, you must specify an interval at which a check is triggered and a condition for stopping a check task on the Check node. The condition can be the point in time at which a check is stopped or the maximum number of checks. If the check is still not passed after the specified time elapses or the maximum number of checks is reached, the check task exits and enters the failed state. For more information about policy configuration, see the Step 2: Configure a check policy for the Check node section in this topic.

Note

MaxCompute partitioned tables, FTP files, or OSS objects can be periodically checked by using Check nodes. You must configure the scheduling time to run a task on a Check node based on the expected check start time. The Check node remains in the running state after the conditions for node scheduling are met. If the condition in the check policy is met, the task on the Check node is successfully run. If the check is not passed for a long period of time, the task on the Check node fails. For more information about scheduling configuration, see the Step 3: Configure scheduling properties for the Check node section in this topic.

Limits

You cannot use the shared resource group for scheduling to run tasks on Check nodes.
If you purchased an exclusive resource group for scheduling before November 1, 2023, you must contact technical support to upgrade the configurations of the resource group before you can use the resource group to run tasks on Check nodes. If you use a resource group whose configurations are not upgraded to run tasks on Check nodes, the error message java.lang.RuntimeException: unknown type : 241 appears.
A Check node can be used to check only one object. If your task depends on multiple objects, such as multiple MaxCompute partitioned tables, you must create multiple Check nodes to separately check these objects.

Prerequisites

Before a Check node can be used to check the availability of an object, a corresponding data source must be added.

MaxCompute partitioned tables
1. A MaxCompute data source is added to DataWorks and is associated with DataStudio. You must add a MaxCompute project as a MaxCompute data source in DataWorks before you can use the data source to access data in the MaxCompute project. For more information, see Add a MaxCompute data source and Preparations before data development: Associate a data source or a cluster with DataStudio.
2. A MaxCompute partitioned table is created. For more information, see Create and manage MaxCompute tables.
FTP files: An FTP data source is added. You must add the FTP service as an FTP data source in DataWorks before you can use the data source to access data of the FTP service. For more information, see FTP data source.
OSS objects: An OSS data source is added and an AccessKey pair is configured to access the OSS data source. You must add an OSS bucket as an OSS data source in DataWorks before you can use the data source to access data in the bucket. For more information, see Create a bucket and OSS data source.

Note

You can access an OSS data source only by using an AccessKey pair in a Check node. You cannot use an OSS data source that is added in RAM role-based authorization mode in a Check node.

Step 1: Create a Check node

Go to the DataStudio page.
Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
Move the pointer over the icon and choose Create Node > General > Check Node.
In the Create Node dialog box, configure the Path and Name parameters as prompted and click Confirm.

Step 2: Configure a check policy for the Check node

You can use Check nodes to check MaxCompute partitioned tables, FTP files, or OSS objects based on your business requirements. For different check objects, you must configure different check policies.

Configure a check policy for a MaxCompute partitioned table

The following table describes the parameters.

Parameter	Description
Data Source Type	Select MaxCompute.
Data Source Name	The name of the data source to which the MaxCompute partitioned table that you want to check belongs. If no data source is available, click New data source to add a data source. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.
Table Name	The name of the MaxCompute partitioned table that you want to check. Note You can select only a MaxCompute partitioned table that belongs to the specified data source.
Partition	The name of the partition in the MaxCompute partitioned table that you want to check. You can click Preview Table Information next to Table Name to obtain the partition name. You can also use scheduling parameters to obtain the partition name. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
Condition For Check Passing	Specifies the check method and check passing condition of the partitioned table. Valid values: Partition Existed: checks whether the specified partition exists. If the partition exists, the partitioned table passes the check, and the system considers that the partitioned table is available. If the partition does not exist, the partitioned table fails the check, and the system considers that the partitioned table is unavailable. Last Modification Time Not Updated for Specific Duration: checks whether data in the specified partition is updated within a specified period of time. This method is used based on the LastModifiedTime parameter. If data in the partition is not updated within the specified period of time, the partitioned table passes the check, and the system considers that the data write operation is complete and the partitioned table is available. If data in the partition is updated within the specified period of time, the partitioned table fails the check, and the system considers that the data write operation is not complete and the partitioned table is unavailable. Note You can check whether partition data is updated within 5, 10, 15, 20, 25, or 30 minutes. For more information about the LastModifiedTime parameter, see Change the value of LastModifiedTime.
Policy For Stopping Check	The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency. Time for Stopping Check: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify an end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state. Note If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running. Checks Allowed Before Check Node Stops: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state. Note The maximum running duration of a check task is 24 hours. The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an FTP file

The following table describes the parameters.

Parameter	Description
Data Source Type	Select FTP.
Data Source Name	The name of the data source to which the FTP file that you want to check belongs. If no data source is available, click New data source to add a data source. For more information about how to add an FTP data source, see FTP data source.
File Path	The storage path of the FTP file that you want to check. Example: /var/ftp/test/. If a specified path already exists, a file that has the same name exists. You can directly enter a path or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
Condition For Check Passing	Specifies the check passing condition of the FTP file. If the FTP file exists, the check is passed, and the system considers that the FTP file is available. If the FTP file does not exist, the check is not passed, and the system considers that the FTP file is unavailable.
Policy For Stopping Check	The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency. Time for Stopping Check: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify an end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state. Note If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running. Checks Allowed Before Check Node Stops: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state. Note The maximum running duration of a check task is 24 hours. The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an OSS object

The following table describes the parameters.

Parameter	Description
Data Source Type	Select OSS.
Data Source Name	The name of the data source to which the OSS object that you want to check belongs. If no data source is available, click New data source to add a data source. For more information about how to add an OSS data source, see OSS data source.
File Path	The storage path of the OSS object that you want to check. The path must conform to the object path format of OSS. If the path ends with a forward slash (/), the Check node checks whether a folder with the same name exists in OSS. Example: If the path is user/, the Check node checks whether a folder named user exists. If the path does not end with a forward slash (/), the Check node checks whether an object with the same name exists in OSS. Example: If the path is user, the Check node checks whether an object named user exists. Note After you select a data source, the system automatically uses the bucket that is configured in the data source. You do not need to specify bucket information in the path. After you enter a path, you can click View Complete Path Information to view the endpoint and bucket information of the OSS data source in the development environment.
Condition For Check Passing	Specifies the check passing condition of the OSS object. If the OSS object exists, the check is passed, and the system considers that the OSS object is available. If the OSS object does not exist, the check is not passed, and the system considers that the OSS object is unavailable.
Policy For Stopping Check	The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency. Time for Stopping Check: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify an end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state. Note If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running. Checks Allowed Before Check Node Stops: You can set the check interval to 5, 10, 15, 20, 25, or 30 minutes and specify the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state. Note The maximum running duration of a check task is 24 hours. The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Step 3: Configure scheduling properties for the Check node

If you want the system to periodically run a task on the Check node to check partition data, you can click Properties in the right-side navigation pane on the configuration tab of the Check node to configure properties for the node based on your business requirements. For more information, see Overview.

You must configure scheduling properties such as scheduling dependencies and scheduling time for a Check node in the way that you configure scheduling properties for other types of nodes. Each node in DataWorks must be configured with upstream dependencies. If the Check node does not have ancestor nodes, you can select a zero load node or the root node in the current workspace as the ancestor node of the Check node based on the complexity of your business. For more information, see Create and use a zero load node.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.

Step 4: Commit and deploy a task on the node

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

Click the icon in the top toolbar to save the node.
Click the icon in the top toolbar to commit a task on the node.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code and perform smoke testing after you commit the task based on your business requirements.
Note
- You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.
- You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
- To ensure that a task on the node you created can be run as expected, we recommend that you perform smoke testing before you deploy the task. For more information, see Perform smoke testing.

If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner of the node configuration tab to deploy a task on the node to the production environment for running after you commit the task on the node. For more information, see Deploy tasks.

What to do next

After you commit and deploy a task on the Check node to Operation Center in the production environment, DataWorks runs the task on the Check node on a regular basis based on the scheduling configurations of the node. You can view the check results of the node and perform O&M operations in Operation Center. For more information, see Perform basic O&M operations on auto triggered tasks.