In most cases, nodes and workflows in the Workspace Directories section of the DATA STUDIO pane in the DataWorks console need to be periodically scheduled. To enable the system to periodically schedule a node or workflow, you must configure scheduling properties for the node or workflow on the Properties tab. The scheduling properties include the scheduling cycle, scheduling dependencies, and scheduling parameters. This topic provides an overview of the configuration of scheduling properties.
Prerequisites
A node is created. Data development in DataWorks is based on nodes. Tasks of different types of compute engines are encapsulated into different types of nodes in DataWorks. You can select a specific type of node for data development based on your business requirements. For more information, see the topics in the Node development directory.
The Periodic scheduling switch is turned on. A node can be automatically scheduled based on its scheduling properties only if Periodic scheduling is turned on for the workspace to which the node belongs on the Scheduling Settings tab of the Settings page in Data Studio. You can turn on the switch on the Scheduling Settings tab of the Settings page in Data Studio for a workspace. For more information, see System Settings of Data Studio.
Precautions
Scheduling configurations define the scheduling properties used to run a node. The node can be scheduled based on the scheduling properties only after the node is deployed to the production environment.
The scheduling time specified for a node in Data Studio is the expected running time of an instance that is generated for the node. The actual running time of the instance is affected by the running situation of the ancestor instance of the current instance. For information about the conditions that must be met before a node starts to run, see Use the Intelligent Diagnosis feature.
DataWorks allows you to configure scheduling dependencies between nodes that have different scheduling frequencies. Before you configure scheduling dependencies, we recommend that you view the Principles and samples of scheduling configurations in complex dependency scenarios topic to understand the principles and samples of scheduling configurations in complex dependency scenarios.
In DataWorks, an auto triggered node generates instances based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance.
If you configure scheduling parameters, the input parameters in the code of an auto triggered node in each scheduling cycle are determined by the scheduling time of the node in the specific scheduling cycle and the expressions of the scheduling parameters. For information about the replacement relationship between input parameters in node code and configurations of scheduling parameters, see Supported formats of scheduling parameters.
A workflow contains the Workflow node and other inner nodes. The dependencies between the nodes are complex. This topic describes only how to configure dependencies and other scheduling settings for a single node. For information about the description for the dependencies between workflows, see Auto triggered workflow.
Go to the Properties tab
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
Go to the Properties tab.
In the Workspace Directories section of the DATA STUDIO pane, find the desired node and click the node name to go to the configuration tab of the node.
In the right-side navigation pane of the configuration tab of the node, click Properties.
Configure scheduling properties for the node
On the Properties tab, you can configure parameters in the Scheduling Parameters, Scheduling Policies, Scheduling Time, Scheduling Dependencies, and Node Output Parameters sections.
(Optional) Scheduling Parameters
If you define a variable when you edit node code, you must assign a value to the variable in the Scheduling Parameters section.
Scheduling parameters are automatically replaced with specific values based on the data timestamps of nodes and value formats of scheduling parameters. This enables dynamic parameter configuration for node scheduling.
Configure parameters in the Scheduling Parameters section
You can use one of the following methods to define scheduling parameters.
Method | Description | Screenshot |
Add a parameter | You can click Add Parameter to configure multiple scheduling parameters for a node.
| |
Load parameters in node code | After you click Load Parameters in Code, DataWorks identifies the names of the variables defined in the code of the current node and adds the identified variable names to the Parameters section. Note In most cases, custom variables are defined in the format of For PyODPS nodes or common Shell nodes, variables are defined by using a different method from that for other types of nodes. For more information about how to define custom variables for different types of nodes, see Configure scheduling parameters for different types of nodes. |
Supported formats of scheduling parameters
For more information about the supported formats of scheduling parameters, see Supported formats of scheduling parameters.
View the configurations of scheduling parameters in Operation Center in the production environment
To prevent unexpected configurations of scheduling parameters from affecting the running of an auto triggered node, we recommend that you check the configurations of the scheduling parameters for the auto triggered node on the Auto Triggered Nodes page in Operation Center in the production environment after the auto triggered node is deployed. For information about how to view an auto triggered node, see View and manage auto triggered nodes.
Scheduling Policies
In the Scheduling Policies section, you can define various information about an auto triggered node, such as the instance generation mode, scheduling type, computing resource, and resource group.
Parameter | Description |
Instance Generation Mode | After an auto triggered node is committed and deployed to the scheduling system in the production environment, DataWorks generates auto triggered node instances for the node. The auto triggered instances are scheduled based on the value of the Instance Generation Mode parameter. Valid values of the Instance Generation Mode parameter:
|
Scheduling Type |
|
Timeout Period | You can configure the Timeout Period parameter to specify a timeout period for a node. If the period of time for which the node is run exceeds the specified timeout period, the node fails. Take note of the following items when you configure this parameter:
|
Support For Rerun | You can configure this parameter to define the rerun property of a node. This parameter is required. Valid values of this parameter:
|
Auto Rerun Upon Failure | If you enable this feature, the scheduling system automatically reruns the related node based on the number of reruns and the rerun interval after the node fails. Take note that the scheduling system does not automatically rerun a node if you manually terminate the node.
Note
|
Computing Resource | The computing resource that you want to use to run the node. You can configure this parameter based on your business requirements. |
Resource Group For Scheduling | The resource group for scheduling that you want to use to run the node. You can configure this parameter based on your business requirements.
|
Scheduling Time
You can configure information such as the scheduling cycle and scheduling time for an auto triggered node in the Scheduling Time section.
If your auto triggered node is stored in a workflow, you must configure the parameters in the Scheduling Time section for the node on the Properties tab of the workflow. If your auto triggered node is not stored in a workflow, you must configure the parameters in the Scheduling Time section for the node on the Properties tab of the node.
Precautions
The scheduling frequency of a node is unrelated to the scheduling frequency of the ancestor node of the node.
The interval at which the node is scheduled is related to the scheduling frequency of the node and is unrelated to the scheduling frequency of the ancestor node of the node.
DataWorks allows you to configure scheduling dependencies between nodes whose scheduling frequencies are different.
DataWorks generates instances for an auto triggered node based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance. In essence, dependencies between auto triggered nodes are dependencies between instances that are generated for the nodes. The number of instances generated for ancestor and descendant auto triggered nodes and dependencies between the instances vary based on the scheduling frequencies of the ancestor and descendant nodes. For information about how to configure scheduling dependencies between nodes whose scheduling frequencies are different, see Select a scheduling dependency type (cross-cycle scheduling dependency).
Dry-run instances are generated for a node on the days when the node is not scheduled to run.
For a node that is not scheduled to run every day, such as a node scheduled by week or month, DataWorks generates dry-run instances for the node on the days when it is not scheduled to run. The dry-run instances return success results when the scheduling time of the node arrives on these days. This way, if a node scheduled by day depends on the node scheduled by week or month, the node scheduled by day can be run as expected. In this case, the node scheduled by week or month is dry run, but the node scheduled by day is run as scheduled.
Time when a node is run.
You can specify the time when you want to schedule a node. The actual time when the node is run is affected by multiple factors. The running of a node is affected by various factors such as the scheduling time of the ancestor node of the node, resources required to run the node, and conditions for running the node. For more information, see What are the conditions that are required for a node to successfully run?
Configure parameters in the Scheduling Time section
Parameter | Description |
Scheduling Cycle | The scheduling frequency of a node determines the number of cycles that the node is automatically run in the scheduling scenario. A scheduling frequency is used to define the interval at which the code logic of a node is actually executed in the scheduling system in the production environment. DataWorks generates instances for the node based on the scheduling frequency and the number of scheduling cycles of the node. For example, the number of instances generated for a node scheduled by hour every day is the same as the number of scheduling cycles of the node every day. The node is run as an instance.
Important For a node that is scheduled by week, month, or year, DataWorks generates dry-run instances for the node on the days when it is not scheduled to run. The dry-run instances return success results but do not generate data. |
Effective Period | You can specify a validity period during which a node is automatically run as scheduled. The node is not automatically run in the period of time that falls out of the specified time range. Nodes whose validity period expires are expired nodes. You can view the number of expired nodes on the O&M Dashboard page of Operation Center and undeploy the nodes based on your requirements. |
Cron Expression | A cron expression is automatically generated based on the configurations of time properties. |
Scheduling Dependencies
Scheduling dependencies in DataWorks define the relationships between nodes in scheduling scenarios. After you configure scheduling dependencies for a node, the node can start to run only after its ancestor node is successfully run. Scheduling dependencies help ensure that a node can obtain the required data for its running from its ancestor node. If the ancestor node of the node is successfully run, DataWorks detects that the latest data is generated by the ancestor nodes based on the status of the ancestor node. Then, the node obtains the generated data. This prevents the node from failing to obtain the required data.
Precautions
After you configure scheduling dependencies for a node, the node can start to run only after its ancestor node is successfully run. Otherwise, data quality issues may occur when the node obtains data from its ancestor nodes.
The time at which a node is run is determined by the scheduling time, which is the expected running time of the node in scheduling scenarios, and the time at which its ancestor node finishes running. This indicates that the actual running time of the node also depends on the scheduling time of its ancestor node. If the ancestor node does not finish running, the node cannot start to run at its scheduling time even if the scheduling time of the node is earlier than that of its ancestor node. For information about the running conditions of a node, see Use the Intelligent Diagnosis feature.
Configure parameters in the Scheduling Dependencies section
Scheduling dependencies between nodes in DataWorks ensure that descendant nodes can obtain valid data from ancestor nodes. In essence, scheduling dependencies between nodes are dependencies between the data lineage of ancestor and descendant tables. You can determine whether to configure scheduling dependencies for a node based on the lineages between tables generated by the node and its ancestor node based on your business requirements. The following figure shows the procedure for configuring scheduling dependencies for a node.
If you configure scheduling dependencies for a node based on the table data lineage, the system determines that a strong lineage exists between the table data. This indicates that the table data that is generated by the node depends on the table data that is generated by its ancestor nodes. Before you configure scheduling dependencies for the node based on the table lineage, you must check whether a strong lineage exists between the table data generated by the node and its ancestor node. To check whether a strong lineage exists, you can check whether the node can obtain valid data if the ancestor node fails to generate data. If the node fails to obtain valid data, a strong lineage exists.
No. | Description |
1 | To ensure that the current node can be run at its scheduling time, you can check whether a strong lineage exists between the table data that is generated by the current node and its ancestor node, and determine whether to configure scheduling dependencies for the current node based on the lineage between the table data. |
2 | Check whether the table data on which the current node depends is generated by an auto triggered node. DataWorks determines whether the table data of an auto triggered node is generated based on the status of the node. However, DataWorks cannot determine whether the table data of non-auto triggered nodes is generated based on the status of the nodes. If you use tables that are not generated by auto triggered nodes, you cannot specify the tables when you configure scheduling dependencies. Tables whose data is not generated based on periodic scheduling in DataWorks include but are not limited to the following tables:
|
3 and 4 | You can determine whether to configure the same-cycle or previous-cycle scheduling dependencies between the current node and its ancestor node based on the following conditions: Whether the current node needs to depend on the data that is generated by its ancestor node on the previous or current day, and whether the instance generated for the current node in the current cycle needs to depend on the instance generated for the node in the previous cycle if the current node is scheduled by hour or minute.
Note For information about how to select a scheduling dependency type and make the related configurations when you configure scheduling dependencies for a node based on lineages, see Select a scheduling dependency type (same-cycle scheduling dependency). |
5, 6, and 7 | After the scheduling dependencies are configured and deployed to the production environment, you can check whether the scheduling dependencies meet your expectations on the Auto Triggered Nodes page in Operation Center. |
Configure custom scheduling dependencies for a node
You can configure scheduling dependencies for a node in the following scenarios based on your business requirements: Scenario 1: No strong lineage exists between the node and its ancestor node. For example, the node does not strongly depend on the data in a specific partition in the output tables of its ancestor node but depends only on the data in the partition that has the largest partition key value. Scenario 2: The node depends on table data that is not generated by an auto triggered node. For example, the node depends on the table data that is uploaded from your on-premises machine. You can use the following methods to configure scheduling dependencies for a node based on your business requirements:
Specify the root node of a workspace as the ancestor node
For example, you can use the root node of a workspace as the ancestor node if a data synchronization node depends on data in other business databases or an SQL node processes the table data generated by a real-time synchronization node.
Specify a zero load node as the ancestor node
In a workspace, if a workflow contains a large number of nodes or has complex relationships among nodes, you can use a zero load node to manage the nodes in the workflow in a centralized manner. You can specify the zero load node as the ancestor node of the nodes that you want to manage. This way, data forwarding paths in the workspace are clearer. For example, you can use a zero load node to determine the scheduling time for nodes in a workflow, and schedule or freeze nodes in a workflow in a centralized manner.
Node Output Parameters
After you define an output parameter and its value for a node, you can define an input parameter for the descendant node of the node and configure the descendant node to reference the value of the output parameter in the input parameter.
Precautions
An output parameter defined for a node can be used only as an input parameter of the descendant node of the node. Output parameters of specific nodes cannot be used to directly pass the query results of the nodes to their descendant nodes. To configure an output parameter of a node as an input parameter of the descendant node of the node, you can add a scheduling parameter for the descendant node in the Scheduling Parameters section of the Properties tab of the descendant node, and click the
icon in the Actions column to associate the output parameter with the scheduling parameter. If you want a node to use the query results of its ancestor node, you can use an assignment node to pass the query results. For more information,
You can configure output parameters for the following types of nodes:
EMR Hive
,EMR Spark SQL
,ODPS Script
,Hologres SQL
,AnalyticDB for PostgreSQL
, andMySQL
.
Configure parameters in the Node Output Parameters section
You can define output parameters in the Node Output Parameters section. Values of the constant and variable types are supported for output parameters.
After you configure output parameters for a node and deploy the node, you can associate the output parameters with scheduling parameters configured for the descendant node of the node. This way, the output parameters are used as the input parameters of the descendant node.
Parameter Name: the name of an output parameter.
Parameter Value: the value of the output parameter. The value type can be constant or variable.
If the value type is constant, the value is a fixed string.
If the value type is variable, the value can be a global variable, built-in parameter, or custom parameter.
References
For more information about how to configure parameters in the Scheduling Parameters section, see Supported formats of scheduling parameters.
For more information about how to configure parameters in the Scheduling Policies section, see the following topics:
For more information about how to configure parameters in the Scheduling Time section, see Scheduling time.
For more information about how to configure parameters in the Scheduling Dependencies section, see the following topics:
For more information about how to configure parameters in the Node Output Parameters section, see the following topics:
Other references