Configure task scheduling properties in Data Studio - DataWorks

Prerequisites

A node has been created. Different engine tasks use different node types. Develop nodes.
The Enable Periodic Scheduling switch is turned on for the workspace. Enable it on the Scheduling Settings page. System settings.

Notes

Scheduling configurations take effect only after the task is deployed to production.
The scheduling time defines the expected execution time. The actual time depends on ancestor node status. Diagnose task runs.
DataWorks supports dependencies between various task types. Review Principles and examples for configuring scheduling in complex dependency scenarios before configuring complex dependencies.
A scheduled task generates scheduled instances based on its scheduling type and period. For example, an hourly task generates hourly instances each day, which run the task automatically.
When you use scheduling parameters, the scheduled time and your parameter expressions determine the values passed to the code. Sources and expressions of scheduling parameters.
A workflow includes the workflow node and its internal nodes, creating complex dependencies. This topic covers scheduling configuration for individual nodes only. Orchestrate recurring workflows.

Go to the scheduling configuration page

Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose Shortcuts > Data Studio in the Actions column.
Go to the scheduling configuration page.
1. In Data Studio, open the target node editor.
2. Click Schedule Settings in the right-side navigation pane of the node editor page.

Configure node scheduling properties

The scheduling configuration page includes the Scheduling Parameters, Scheduling Policy, Scheduling Time, Scheduling Dependencies, and Node Output Parameters sections.

(Optional) Scheduling parameters

If you defined variables in the node code, you must assign values to them here.

Scheduling parameters are automatically replaced with values based on the business date and the parameter expressions, enabling dynamic parameter replacement during task runs.

Configure scheduling parameters

You can define scheduling parameters in the following two ways.

Method

Description

Example

Add parameter

You can configure multiple scheduling parameters per task. Click Add Parameter to add more.

Manually assign values to scheduling parameters. Sources and expressions of scheduling parameters.
You can also click in the Actions column of the parameter to associate the parameter defined in the current node with an output parameter of an upstream node.

Load parameters from code

This method automatically identifies variables defined in the task code and adds them as scheduling parameters.

Note

Typically, variables are defined in code in the format ${custom_variable_name}.

PyODPS nodes and Shell nodes define variables differently from other nodes. For the scheduling parameter configuration formats for each node type, see Configure scheduling parameters for different node types.

Supported formats for scheduling parameters

Sources and expressions of scheduling parameters.

Verify scheduling parameter configurations for production tasks

To avoid unexpected scheduling parameter values at runtime, verify the parameter configuration on the Scheduled Tasks page in Operation Center after deployment. View scheduled tasks.

Scheduling policy

The scheduling policy defines how instances are generated, the scheduling type, compute resources, and resource groups for a task.

Parameter	Description
Instance generation mode	After a node is submitted and deployed to production, the system generates Scheduled Instances based on the Instance Generation Mode. T+1 Next Day: After the node is deployed to the production environment, automatic scheduling starts from the next day. You can view the execution status of the task on the scheduled instances page. If you need to run the task on the same day, you can backfill data for the task. The execution results of backfill instances with a business date of `yesterday` and `today` are the same. Immediately After Deployment: After the node is deployed to the production environment, automatic scheduling starts on the same day. You can view the execution status of the task on the scheduled instances page. For a newly created task, whether the task actually processes data or performs a dry run on the same day depends on the scheduled time and the deployment time. When you modify the schedule of a deployed production task, DataWorks replaces the generated instances for future time periods based on the latest scheduling configuration, but expired instances are not deleted.
Scheduling type	Normal Scheduling Use case: The scheduled task runs normally, and the generated scheduled instances also run normally. Impact: The task is triggered at the scheduled time and runs normally (data is actually processed). After the current node runs successfully, it also triggers the normal scheduling of downstream nodes. By default, this scheduling type is selected. Pause Scheduling Use case: The scheduled task is in frozen status, and the generated scheduled instances are also frozen. The current node cannot run and blocks downstream nodes from running. To temporarily stop a business process, freeze the root node. Unfreeze it when the business needs to resume. Unfreeze tasks. Impact: The task is triggered at the scheduled time, but the node status is set to paused (data is not actually processed). When the scheduler reaches this task, the system directly returns a failure, and the downstream nodes that depend on the current node are blocked from running. Dry Run Use case: When a node does not need to run for a period of time and you do not want it to block downstream nodes, you can select this scheduling type. Impact: The task is triggered at the scheduled time, but the node performs a dry run (data is not actually processed). When the scheduler reaches this task, the system directly returns a success (with a duration of `0` seconds). The task is not actually executed (the runtime log is empty), does not block downstream nodes from running (downstream nodes run normally), and does not consume resources.
Timeout	If a task exceeds the specified timeout, it is automatically terminated. The timeout applies to scheduled instances, backfill instances, and test instances. The default timeout is 3 to 7 days. The system dynamically adjusts the default timeout based on the actual load. When you manually set a timeout, the maximum value is 168 hours (7 days).
Rerun properties	Configure the rerun behavior of the node. The rerun properties cannot be empty. The supported types are: Allow Rerun After Success or Failure: Select this rerun type if rerunning the node multiple times does not affect the results. Deny Rerun After Success, Allow Rerun After Failure: Select this rerun type if rerunning the node after a successful run affects the results, but rerunning after a failure does not. Deny Rerun After Success or Failure: Select this rerun type if rerunning the node after either a success or failure affects the results (for example, certain data synchronization nodes). Note When this type is selected, the system does not automatically rerun the node even after a system failure is recovered. The Auto Rerun upon Failure option cannot be configured.
Auto rerun upon failure	When enabled, if a task fails (excluding manual termination), the system automatically reruns based on the configured count and interval. Rerun Count: The default number of automatic reruns when a scheduled task fails. The minimum rerun count is 1 (the task is automatically rerun once after a failure), and the maximum is 10 (the task is automatically rerun up to 10 times after a failure). You can modify this value based on your business needs. Rerun Interval: The default interval between reruns is 30 minutes. The minimum supported value is 1 minute, and the maximum is 30 minutes. Note Set workspace-level default rerun count and interval on the Schedule Settings page. System settings. If a node fails because the timeout is exceeded, the auto rerun configuration does not take effect.
Compute resource	Configure the engine resource for the task. To create a resource, see Manage compute resources.
Compute quota	You can configure the compute quota required for the task to run in MaxCompute or EMR. The quota provides compute resources (CPU and memory) for compute jobs.
Resource group	Select the scheduling resource group for the task. To change the default resource group for new tasks, go to the Scheduling Settings page. System settings. To change the resource group of an existing task, see Change the resource group of a task.
Maximum concurrent instances	Limits the maximum number of concurrent instances of the same task for concurrency control and resource protection. By default, no limit is set. After you enable the limit, you can set the number of concurrent instances. The default value is `1`, and the valid range is `1 to 10000`. The concurrent instance limit applies to the following scopes: Scheduled workflows: Scheduled instances, backfill instances, and test instances. Trigger-based workflows: Triggered instances. Note Trigger-based workflows support setting the maximum number of concurrent instances for internal tasks within the workflow, which limits the concurrency of all internal node instances generated by the workflow. When this is configured together with the maximum concurrent instances for a single node, both limits must be satisfied. After you enable the maximum concurrent instances limit, instances generated before the limit was enabled are not affected. Only instances generated afterward are subject to the limit. When multiple types of instances are queued simultaneously, only backfill instances that are not for the current day have their priority lowered.
Dataset	Click to add a created dataset to the node. Only PyODPS, Python, and Shell nodes support adding datasets during development. Dataset: Select a dataset created in the current workspace from the drop-down list. When you select a dataset of the Object Storage Service (OSS) type, you must grant the resource group access to the bucket for the first time. Each bucket needs to be authorized only once. When you select a dataset of the File Storage NAS type and the DataWorks resource group network is not connected to the NAS mount target, you must adjust the VPC network to ensure connectivity between the resource group and the NAS mount target. Note When the VPC associated with the DataWorks resource group is the same as the VPC associated with the NAS mount target, the network connectivity is established. Mount path: The default mount path configured for the dataset is automatically loaded. You can manually modify it. Advanced configuration: When reading OSS or NAS datasets during node development, you can adjust the read method and mount protocol by configuring different datasets. Read-only: When enabled, the data development task can only read data during runtime and cannot write data to OSS or NAS.

Scheduling time

The scheduling time defines the period and time for automatic execution of a task.

Note

For nodes within a workflow, the Scheduling Time parameters are configured in the Schedule Settings section on the workflow page. For standalone nodes that are not in a workflow, the Scheduling Time is configured in the Schedule Settings section of each node.

Notes

The scheduling frequency of a task is not related to its upstream tasks

How often a task is scheduled depends on its own scheduling period, not on the scheduling period of its upstream tasks.
DataWorks supports dependencies between tasks with different scheduling periods

In DataWorks, a scheduled task generates corresponding scheduled instances based on its scheduling type and period. For example, an hourly task generates a specific number of hourly instances each day. Tasks run through their instances. The dependencies configured for scheduled tasks are essentially dependencies between the instances generated by those tasks. When upstream and downstream tasks have different scheduling types, the number of generated instances and the instance dependency relationships differ. For more information about dependencies between tasks with different scheduling periods, see Cross-cycle dependency.
Tasks perform dry runs outside their scheduled time

In DataWorks, tasks that are not scheduled daily (such as weekly or monthly tasks) perform a dry run outside their scheduled time, immediately returning a success status when the scheduled time is reached. If a downstream daily task exists, the dry run triggers the downstream daily task to run normally. In other words, the upstream task performs a dry run, and the downstream scheduled task runs normally based on its own scheduled time.
Notes on task execution time

The configured time is the expected scheduling time. Actual execution depends on upstream task completion, available resources, and other conditions. Task run conditions.

Configure scheduling time

Parameter	Description
Schedule	The schedule defines the period at which the task automatically runs in production. A task generates scheduled instances based on its schedule. For example, an hourly task generates hourly instances each day that run the task automatically. Minute schedule: During a specified time range each day, the scheduled task runs at intervals of `N * specified minutes`. The minimum granularity of the Time Interval for minute schedules is 1 minute. Hourly schedule: During a specified time range each day, the scheduled task runs at intervals of `N * 1 hour`. Daily schedule: The node runs once each day at the specified time. When you create a scheduled task, the default daily schedule runs once at 00:00 each day. You can specify a different run time as needed. Weekly schedule: The scheduled task automatically runs once at a specific time on specific days of each week. Monthly schedule: The scheduled task automatically runs once at a specific time on specific days of each month. Yearly schedule: The scheduled task automatically runs once at a specific time on specific days of each year. Important For weekly, monthly, and yearly schedules, instances are still generated daily outside the scheduled time. These instances show a success status but actually perform a dry run and do not actually execute the task.
Effective date	The node is automatically scheduled within the specified effective date range. Expired tasks are no longer scheduled. View expired task counts on the O&M overview page and undeploy them as needed.
Cron expression	This expression is automatically generated based on the time property configuration. No manual configuration is required.

Scheduling dependencies

Scheduling dependencies define upstream-downstream relationships between nodes. A downstream node starts only after its upstream nodes succeed. This ensures the downstream task reads correct, fully-produced data.

Notes

After you configure node dependencies, by default, one of the run conditions for a downstream node during scheduled execution is that all its upstream nodes have completed successfully. Otherwise, data quality issues may occur in the current task.
The actual execution time depends on both the task's own scheduled time and the completion time of its upstream tasks. If an upstream task has not completed, the downstream task waits even if its scheduled time has arrived. Diagnose task runs.

Configure scheduling dependencies

The goal of configuring task dependencies is to ensure downstream tasks read correct data. Dependencies are essentially lineage dependencies between upstream and downstream tables. Configure dependencies based on table lineage as needed.

Node dependencies create a strong dependency by default — the downstream table depends on the upstream table's data production. Determine whether a strong lineage dependency exists before configuring.

Step	Description
①	To avoid unexpected execution times for the current task, first assess whether a strong dependency exists between the tables to determine whether scheduling dependencies need to be configured based on lineage.
②	Determine whether the upstream data is produced by a DataWorks scheduled task. DataWorks cannot monitor data production for tables not produced by its scheduled tasks, so some tables do not support scheduling dependency configuration. Tables that are not produced by DataWorks periodic scheduling include but are not limited to the following types: Tables produced by real-time synchronization Tables uploaded to DataWorks from a local source Dimension tables Tables produced by manual tasks Tables with periodic changes that are not produced by DataWorks scheduling nodes
③④	Based on whether you need to depend on upstream data from yesterday or today, and whether hourly or minute tasks need to depend on the previous hourly or minute instance, choose between same-cycle or previous-cycle dependency on the upstream. Same-cycle dependency: The downstream task depends on the table data produced by the upstream task on the current day. Previous-cycle dependency (cross-cycle dependency): The downstream task depends on the table data produced by the upstream task on the previous day. Special dependency scenarios for hourly and minute tasks: If a task needs to depend on the data from its own previous hourly or minute scheduled instance, you can set up a cross-cycle dependency. When an hourly task depends on another hourly task and the upstream and downstream scheduled times are exactly the same, setting a cross-cycle dependency allows the downstream 2:00 instance to depend on the upstream 1:00 instance. The same principle applies to minute tasks. Note Same-cycle dependency.
⑤⑥⑦	After the dependency configuration is complete and deployed to production, you can verify whether the task dependencies meet your expectations by viewing the Scheduled Tasks page in Operation Center.

Customize node dependencies

If there is no strong lineage dependency between tasks (for example, the task only reads the latest partition data), or the dependent data is not produced by a scheduled node (for example, locally uploaded data), customize the node dependencies:

Depend on the workspace root node

For example, when the upstream data in a synchronization task comes from other business databases, or an SQL-type task processes table data produced by a real-time synchronization task, you can directly mount the dependency to the workspace root node.
Depend on a virtual node

Use virtual nodes to manage complex business processes. Mount related nodes as dependencies of a virtual node to control overall scheduling time or freeze a business process.

Node output parameters

Define output parameters in an upstream node and reference them as input parameters in downstream nodes to pass values between tasks.

Notes

Node Output Parameters can only be used as input parameters for downstream nodes (add a parameter in the scheduling parameter section of the downstream node, and click in the Actions column to associate it with an upstream parameter). Some nodes cannot directly pass query results from upstream to downstream. If you need to pass the query results of an upstream node to a downstream node, use an assignment node. For more information, see Use assignment nodes.
The following nodes support output parameters: EMR Hive, EMR Spark SQL, ODPS Script, Hologres SQL, AnalyticDB for PostgreSQL, and MySQL nodes.

Configure node output parameters

The values of Node Output Parameters can be of two types: Constant and Variable.

After you define the output parameters and submit the current node, you can Associate Upstream Node Output Parameters as input parameters when configuring scheduling parameters for the downstream node.

Parameter name: The name of the defined output parameter.
Parameter value: The value of the output parameter. The value types include constant and variable:
- A constant is a fixed string.
- A variable can be a system-supported global variable, a built-in scheduling parameter, or a custom parameter.

Configure the associated role for a node

The associated role feature lets you specify a RAM role for a task node. At runtime, the task dynamically obtains temporary STS credentials for the role, allowing your code to access other cloud resources without permanent AccessKey pairs.

Important

Resource group restriction: Only nodes running on serverless resource groups are supported.
Node type restriction: Only Python, Shell, Notebook, PyODPS 2, and PyODPS 3 nodes are supported.

Step 1: Configure the associated role for a DataWorks node

On the right side of the node editor page, find and click Run Configuration.
In the scheduling configuration panel, switch to the Associated Role tab.
In the RAM Role drop-down list, select the RAM role you have prepared.

Important
If the drop-down list is empty or you cannot find the desired role, see Configure a RAM role to complete the RAM role configuration.
After the configuration is complete, submit the node. This configuration takes effect only for debug runs.

Step 2: Obtain and use temporary credentials in code

After you configure the associated role, DataWorks injects the obtained temporary credentials into the runtime environment when the task runs. You can obtain them in your code in the following two ways.

Method 1: Read environment variables (recommended for Shell and Python)

The system automatically sets the following three environment variables, which you can read directly in your code.

LINKED_ROLE_ACCESS_KEY_ID: The temporary AccessKey ID.
LINKED_ROLE_ACCESS_KEY_SECRET: The temporary AccessKey secret.
LINKED_ROLE_SECURITY_TOKEN: The temporary security token.

Code example (Python):

Important

This example requires a custom Python image with oss2 installed. Use custom images.

import os
import oss2

# 1. Get temporary credentials from environment variables
access_key_id = os.environ.get('LINKED_ROLE_ACCESS_KEY_ID')
access_key_secret = os.environ.get('LINKED_ROLE_ACCESS_KEY_SECRET')
security_token = os.environ.get('LINKED_ROLE_SECURITY_TOKEN')

# Check whether the credentials are obtained
if not all([access_key_id, access_key_secret, security_token]):
    raise Exception("Failed to get linked role credentials from environment variables.")

# 2. Use temporary credentials to initialize the OSS client
# Assume that you have granted the role access to 'your-bucket-name'
auth = oss2.StsAuth(access_key_id, access_key_secret, security_token)
bucket = oss2.Bucket(auth, 'http://oss-<regionID>-internal.aliyuncs.com', 'your-bucket-name')

# 3. Use the client to access OSS resources
try:
    # List objects in the bucket
    for obj in oss2.ObjectIterator(bucket):
        print('object name: ' + obj.key)
    print("Successfully accessed OSS with linked role.")
except oss2.exceptions.OssError as e:
    print(f"Error accessing OSS: {e}")

Code example (Shell):

#!/bin/bash
access_key_id=${LINKED_ROLE_ACCESS_KEY_ID}
access_key_secret=${LINKED_ROLE_ACCESS_KEY_SECRET}
security_token=${LINKED_ROLE_SECURITY_TOKEN}

# Access OSS. Replace regionID, bucket_name, and file_name with actual values.
echo "ID："$access_key_id
echo "token:"$security_token
ls -al /home/admin/usertools/tools/

# This example uses ossutil to download a file from a specified OSS path to the local file test_dw.py, and then prints the file content.
/home/admin/usertools/tools/ossutil64 cp --access-key-id $access_key_id --access-key-secret $access_key_secret --sts-token $security_token --endpoint http://oss-<regionID>-internal.aliyuncs.com oss://<bucket_name>/<file_name> test_dw.py
echo "************************ Retrieved successfully ************************, printing result"
cat test_dw.py

Method 2: Use the Credentials Client (recommended for Python)

Code example (Python):

Important

This example requires a custom Python image with oss2 and alibabacloud_credentials installed. Use custom images.

from alibabacloud_credentials.client import Client as CredentialClient
import oss2

# 1. Use the SDK to automatically obtain credentials
# It automatically looks for LINKED_ROLE_* credential information in environment variables
cred_client = CredentialClient()
credential = cred_client.get_credential()

access_key_id = credential.get_access_key_id()
access_key_secret = credential.get_access_key_secret()
security_token = credential.get_security_token()

if not all([access_key_id, access_key_secret, security_token]):
    raise Exception("Failed to get linked role credentials via SDK.")

# 2. Use the credentials to initialize the OSS client
auth = oss2.StsAuth(access_key_id, access_key_secret, security_token)
bucket = oss2.Bucket(auth, 'http://oss-cn-hangzhou.aliyuncs.com', 'your-bucket-name')

# 3. Access OSS
print("Listing objects in bucket...")
for obj in oss2.ObjectIterator(bucket):
    print(' - ' + obj.key)
print("Successfully accessed OSS with linked role via SDK.")

Step 3: Run and verify

Important

Shell, Python: At runtime, the task uses the specified RAM role to access other Alibaba Cloud services.
PyODPS: When accessing other Alibaba Cloud services (such as OSS), the task uses the RAM role you configured. However, when accessing MaxCompute data, the task still uses the access identity configured for the compute resource (at the project level).

Configure scheduling properties

After debugging the node, synchronize the Associated Role from Run Configuration to Associated Role > RAM Role in Schedule Settings. After deployment, the task runs with the identity of the specified role.

If you configured a custom image in Run Configuration, you must also synchronize the settings to the scheduling configuration.

View the execution role in Operation Center

After the task finishes running, view the details of the task instance in Operation Center to confirm whether the specified role was successfully used.

Go to Operation Center > Scheduled Task Operations > Scheduled Instances.
Find the node instance you ran and click it to open the details page.
In the Properties section of the instance details, check the Execution Identity field. This field displays the ARN of the associated role actually used for the run.

An ARN is a unique resource identifier. ARN.

References

Scheduling parameter reference: Sources and expressions of scheduling parameters.
Scheduling policy references:
- Immediately generate instances after deployment.
- Dry run.
Scheduling time reference: Scheduling time reference.
Scheduling dependency references:
Node output parameter reference: Node output parameters.
Other references: Other scheduling references.