Configure scheduling properties — schedule, dependencies, and parameters — for nodes and workflows to run them on a recurring schedule.
Prerequisites
-
A node has been created. Different engine tasks use different node types. Develop nodes.
-
The Enable Periodic Scheduling switch is turned on for the workspace. Enable it on the Scheduling Settings page. System settings.
Notes
-
Scheduling configurations take effect only after the task is deployed to production.
-
The scheduling time defines the expected execution time. The actual time depends on ancestor node status. Diagnose task runs.
-
DataWorks supports dependencies between various task types. Review Principles and examples for configuring scheduling in complex dependency scenarios before configuring complex dependencies.
-
A scheduled task generates scheduled instances based on its scheduling type and period. For example, an hourly task generates hourly instances each day, which run the task automatically.
-
When you use scheduling parameters, the scheduled time and your parameter expressions determine the values passed to the code. Sources and expressions of scheduling parameters.
-
A workflow includes the workflow node and its internal nodes, creating complex dependencies. This topic covers scheduling configuration for individual nodes only. Orchestrate recurring workflows.
Go to the scheduling configuration page
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
-
Go to the scheduling configuration page.
-
In Data Studio, open the target node editor.
-
Click Schedule Settings in the right-side navigation pane of the node editor page.
-
Configure node scheduling properties
The scheduling configuration page includes the Scheduling Parameters, Scheduling Policy, Scheduling Time, Scheduling Dependencies, and Node Output Parameters sections.
(Optional) Scheduling parameters
If you defined variables in the node code, you must assign values to them here.
Scheduling parameters are automatically replaced with values based on the business date and the parameter expressions, enabling dynamic parameter replacement during task runs.
Configure scheduling parameters
You can define scheduling parameters in the following two ways.
|
Method |
Description |
Example |
|
Add parameter |
You can configure multiple scheduling parameters per task. Click Add Parameter to add more.
|
|
|
Load parameters from code |
This method automatically identifies variables defined in the task code and adds them as scheduling parameters. Note
Typically, variables are defined in code in the format PyODPS nodes and Shell nodes define variables differently from other nodes. For the scheduling parameter configuration formats for each node type, see Configure scheduling parameters for different node types. |
|
Supported formats for scheduling parameters
Verify scheduling parameter configurations for production tasks
To avoid unexpected scheduling parameter values at runtime, verify the parameter configuration on the Scheduled Tasks page in Operation Center after deployment. View scheduled tasks.
Scheduling policy
The scheduling policy defines how instances are generated, the scheduling type, compute resources, and resource groups for a task.
|
Parameter |
Description |
|
Instance generation mode |
After a node is submitted and deployed to production, the system generates Scheduled Instances based on the Instance Generation Mode.
|
|
Scheduling type |
|
|
Timeout |
If a task exceeds the specified timeout, it is automatically terminated.
|
|
Rerun properties |
Configure the rerun behavior of the node. The rerun properties cannot be empty. The supported types are:
|
|
Auto rerun upon failure |
When enabled, if a task fails (excluding manual termination), the system automatically reruns based on the configured count and interval.
Note
|
|
Compute resource |
Configure the engine resource for the task. To create a resource, see Manage compute resources. |
|
Compute quota |
You can configure the compute quota required for the task to run in MaxCompute or EMR. The quota provides compute resources (CPU and memory) for compute jobs. |
|
Resource group |
Select the scheduling resource group for the task.
|
|
Maximum concurrent instances |
Limits the maximum number of concurrent instances of the same task for concurrency control and resource protection. By default, no limit is set. After you enable the limit, you can set the number of concurrent instances. The default value is
|
|
Dataset |
Click
|
Scheduling time
The scheduling time defines the period and time for automatic execution of a task.
For nodes within a workflow, the Scheduling Time parameters are configured in the Schedule Settings section on the workflow page. For standalone nodes that are not in a workflow, the Scheduling Time is configured in the Schedule Settings section of each node.
Notes
-
The scheduling frequency of a task is not related to its upstream tasks
How often a task is scheduled depends on its own scheduling period, not on the scheduling period of its upstream tasks.
-
DataWorks supports dependencies between tasks with different scheduling periods
In DataWorks, a scheduled task generates corresponding scheduled instances based on its scheduling type and period. For example, an hourly task generates a specific number of hourly instances each day. Tasks run through their instances. The dependencies configured for scheduled tasks are essentially dependencies between the instances generated by those tasks. When upstream and downstream tasks have different scheduling types, the number of generated instances and the instance dependency relationships differ. For more information about dependencies between tasks with different scheduling periods, see Cross-cycle dependency.
-
Tasks perform dry runs outside their scheduled time
In DataWorks, tasks that are not scheduled daily (such as weekly or monthly tasks) perform a dry run outside their scheduled time, immediately returning a success status when the scheduled time is reached. If a downstream daily task exists, the dry run triggers the downstream daily task to run normally. In other words, the upstream task performs a dry run, and the downstream scheduled task runs normally based on its own scheduled time.
-
Notes on task execution time
The configured time is the expected scheduling time. Actual execution depends on upstream task completion, available resources, and other conditions. Task run conditions.
Configure scheduling time
|
Parameter |
Description |
|
Schedule |
The schedule defines the period at which the task automatically runs in production. A task generates scheduled instances based on its schedule. For example, an hourly task generates hourly instances each day that run the task automatically.
Important
For weekly, monthly, and yearly schedules, instances are still generated daily outside the scheduled time. These instances show a success status but actually perform a dry run and do not actually execute the task. |
|
Effective date |
The node is automatically scheduled within the specified effective date range. Expired tasks are no longer scheduled. View expired task counts on the O&M overview page and undeploy them as needed. |
|
Cron expression |
This expression is automatically generated based on the time property configuration. No manual configuration is required. |
Scheduling dependencies
Scheduling dependencies define upstream-downstream relationships between nodes. A downstream node starts only after its upstream nodes succeed. This ensures the downstream task reads correct, fully-produced data.
Notes
-
After you configure node dependencies, by default, one of the run conditions for a downstream node during scheduled execution is that all its upstream nodes have completed successfully. Otherwise, data quality issues may occur in the current task.
-
The actual execution time depends on both the task's own scheduled time and the completion time of its upstream tasks. If an upstream task has not completed, the downstream task waits even if its scheduled time has arrived. Diagnose task runs.
Configure scheduling dependencies
The goal of configuring task dependencies is to ensure downstream tasks read correct data. Dependencies are essentially lineage dependencies between upstream and downstream tables. Configure dependencies based on table lineage as needed.
Node dependencies create a strong dependency by default — the downstream table depends on the upstream table's data production. Determine whether a strong lineage dependency exists before configuring.
|
Step |
Description |
|
① |
To avoid unexpected execution times for the current task, first assess whether a strong dependency exists between the tables to determine whether scheduling dependencies need to be configured based on lineage. |
|
② |
Determine whether the upstream data is produced by a DataWorks scheduled task. DataWorks cannot monitor data production for tables not produced by its scheduled tasks, so some tables do not support scheduling dependency configuration. Tables that are not produced by DataWorks periodic scheduling include but are not limited to the following types:
|
|
③④ |
Based on whether you need to depend on upstream data from yesterday or today, and whether hourly or minute tasks need to depend on the previous hourly or minute instance, choose between same-cycle or previous-cycle dependency on the upstream.
Note
|
|
⑤⑥⑦ |
After the dependency configuration is complete and deployed to production, you can verify whether the task dependencies meet your expectations by viewing the Scheduled Tasks page in Operation Center. |
Customize node dependencies
If there is no strong lineage dependency between tasks (for example, the task only reads the latest partition data), or the dependent data is not produced by a scheduled node (for example, locally uploaded data), customize the node dependencies:
-
Depend on the workspace root node
For example, when the upstream data in a synchronization task comes from other business databases, or an SQL-type task processes table data produced by a real-time synchronization task, you can directly mount the dependency to the workspace root node.
-
Depend on a virtual node
Use virtual nodes to manage complex business processes. Mount related nodes as dependencies of a virtual node to control overall scheduling time or freeze a business process.
Node output parameters
Define output parameters in an upstream node and reference them as input parameters in downstream nodes to pass values between tasks.
Notes
-
Node Output Parameters can only be used as input parameters for downstream nodes (add a parameter in the scheduling parameter section of the downstream node, and click
in the Actions column to associate it with an upstream parameter). Some nodes cannot directly pass query results from upstream to downstream. If you need to pass the query results of an upstream node to a downstream node, use an assignment node. For more information, see Use assignment nodes. -
The following nodes support output parameters:
EMR Hive,EMR Spark SQL,ODPS Script,Hologres SQL,AnalyticDB for PostgreSQL, andMySQLnodes.
Configure node output parameters
The values of Node Output Parameters can be of two types: Constant and Variable.
After you define the output parameters and submit the current node, you can Associate Upstream Node Output Parameters as input parameters when configuring scheduling parameters for the downstream node.

-
Parameter name: The name of the defined output parameter.
-
Parameter value: The value of the output parameter. The value types include constant and variable:
-
A constant is a fixed string.
-
A variable can be a system-supported global variable, a built-in scheduling parameter, or a custom parameter.
-
Configure the associated role for a node
The associated role feature lets you specify a RAM role for a task node. At runtime, the task dynamically obtains temporary STS credentials for the role, allowing your code to access other cloud resources without permanent AccessKey pairs.
-
Resource group restriction: Only nodes running on serverless resource groups are supported.
-
Node type restriction: Only Python, Shell, Notebook, PyODPS 2, and PyODPS 3 nodes are supported.
Step 1: Configure the associated role for a DataWorks node
-
On the right side of the node editor page, find and click Run Configuration.
-
In the scheduling configuration panel, switch to the Associated Role tab.
-
In the RAM Role drop-down list, select the RAM role you have prepared.
ImportantIf the drop-down list is empty or you cannot find the desired role, see Configure a RAM role to complete the RAM role configuration.
-
After the configuration is complete, submit the node. This configuration takes effect only for debug runs.
Step 2: Obtain and use temporary credentials in code
After you configure the associated role, DataWorks injects the obtained temporary credentials into the runtime environment when the task runs. You can obtain them in your code in the following two ways.
Method 1: Read environment variables (recommended for Shell and Python)
The system automatically sets the following three environment variables, which you can read directly in your code.
-
LINKED_ROLE_ACCESS_KEY_ID: The temporary AccessKey ID. -
LINKED_ROLE_ACCESS_KEY_SECRET: The temporary AccessKey secret. -
LINKED_ROLE_SECURITY_TOKEN: The temporary security token.
Code example (Python):
This example requires a custom Python image with oss2 installed. Use custom images.
import os
import oss2
# 1. Get temporary credentials from environment variables
access_key_id = os.environ.get('LINKED_ROLE_ACCESS_KEY_ID')
access_key_secret = os.environ.get('LINKED_ROLE_ACCESS_KEY_SECRET')
security_token = os.environ.get('LINKED_ROLE_SECURITY_TOKEN')
# Check whether the credentials are obtained
if not all([access_key_id, access_key_secret, security_token]):
raise Exception("Failed to get linked role credentials from environment variables.")
# 2. Use temporary credentials to initialize the OSS client
# Assume that you have granted the role access to 'your-bucket-name'
auth = oss2.StsAuth(access_key_id, access_key_secret, security_token)
bucket = oss2.Bucket(auth, 'http://oss-<regionID>-internal.aliyuncs.com', 'your-bucket-name')
# 3. Use the client to access OSS resources
try:
# List objects in the bucket
for obj in oss2.ObjectIterator(bucket):
print('object name: ' + obj.key)
print("Successfully accessed OSS with linked role.")
except oss2.exceptions.OssError as e:
print(f"Error accessing OSS: {e}")
Code example (Shell):
#!/bin/bash
access_key_id=${LINKED_ROLE_ACCESS_KEY_ID}
access_key_secret=${LINKED_ROLE_ACCESS_KEY_SECRET}
security_token=${LINKED_ROLE_SECURITY_TOKEN}
# Access OSS. Replace regionID, bucket_name, and file_name with actual values.
echo "ID:"$access_key_id
echo "token:"$security_token
ls -al /home/admin/usertools/tools/
# This example uses ossutil to download a file from a specified OSS path to the local file test_dw.py, and then prints the file content.
/home/admin/usertools/tools/ossutil64 cp --access-key-id $access_key_id --access-key-secret $access_key_secret --sts-token $security_token --endpoint http://oss-<regionID>-internal.aliyuncs.com oss://<bucket_name>/<file_name> test_dw.py
echo "************************ Retrieved successfully ************************, printing result"
cat test_dw.py
Method 2: Use the Credentials Client (recommended for Python)
Code example (Python):
This example requires a custom Python image with oss2 and alibabacloud_credentials installed. Use custom images.
from alibabacloud_credentials.client import Client as CredentialClient
import oss2
# 1. Use the SDK to automatically obtain credentials
# It automatically looks for LINKED_ROLE_* credential information in environment variables
cred_client = CredentialClient()
credential = cred_client.get_credential()
access_key_id = credential.get_access_key_id()
access_key_secret = credential.get_access_key_secret()
security_token = credential.get_security_token()
if not all([access_key_id, access_key_secret, security_token]):
raise Exception("Failed to get linked role credentials via SDK.")
# 2. Use the credentials to initialize the OSS client
auth = oss2.StsAuth(access_key_id, access_key_secret, security_token)
bucket = oss2.Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'your-bucket-name')
# 3. Access OSS
print("Listing objects in bucket...")
for obj in oss2.ObjectIterator(bucket):
print(' - ' + obj.key)
print("Successfully accessed OSS with linked role via SDK.")
Step 3: Run and verify
-
Shell, Python: At runtime, the task uses the specified RAM role to access other Alibaba Cloud services.
-
PyODPS: When accessing other Alibaba Cloud services (such as OSS), the task uses the RAM role you configured. However, when accessing MaxCompute data, the task still uses the access identity configured for the compute resource (at the project level).
Configure scheduling properties
After debugging the node, synchronize the Associated Role from Run Configuration to in Schedule Settings. After deployment, the task runs with the identity of the specified role.
If you configured a custom image in Run Configuration, you must also synchronize the settings to the scheduling configuration.
View the execution role in Operation Center
After the task finishes running, view the details of the task instance in Operation Center to confirm whether the specified role was successfully used.
-
Go to .
-
Find the node instance you ran and click it to open the details page.
-
In the Properties section of the instance details, check the Execution Identity field. This field displays the ARN of the associated role actually used for the run.
An ARN is a unique resource identifier. ARN.
References
-
Scheduling parameter reference: Sources and expressions of scheduling parameters.
-
Scheduling policy references:
-
Scheduling time reference: Scheduling time reference.
-
Scheduling dependency references:
-
Node output parameter reference: Node output parameters.
-
Other references: Other scheduling references.


to add a