DLC periodic scheduling, automated job submission, scheduled training jobs - Platform For AI

Automate your machine learning workflows with periodic scheduling for Deep Learning Containers (DLC) jobs. When your training data updates regularly or you need continuous model fine-tuning with new hyperparameters, periodic scheduling automatically submits jobs at specified intervals without manual intervention. This feature integrates seamlessly with DataWorks to deliver robust, production-ready automation for your AI training pipelines.

Benefits and use cases

Periodic scheduling for DLC jobs delivers key benefits for production machine learning workflows:

Automated workflows: Eliminate manual job submission for recurring training tasks, reducing operational overhead and human error.
Incremental training: Automatically retrain models when new data arrives, keeping your models current with the latest information.
Continuous optimization: Schedule regular hyperparameter tuning experiments to continuously improve model performance.
Production reliability: Leverage DataWorks' enterprise-grade scheduling capabilities, including dependency management, retry logic, and comprehensive monitoring.

Common use cases for periodic scheduling include:

Daily model retraining: Retrain models daily using fresh data from the previous day's operations.
Weekly model evaluation: Run comprehensive model evaluation and comparison jobs weekly.
Monthly fine-tuning: Perform extensive hyperparameter tuning and model architecture experiments monthly.
Data drift detection: Schedule regular data quality and drift detection jobs to monitor dataset changes.

Background information

You can use one of the following methods to configure periodic scheduling for a DLC job:

Choose your method

DLC periodic scheduling offers two implementation methods. Choose the approach that best matches your requirements:

Criteria	Method 1: PAI DLC Node	Method 2: Script Task
Complexity	Low - GUI-based configuration	Medium - Requires script writing and command-line operations
Flexibility	Limited to pre-configured job parameters	High - Full control over job configuration and dynamic parameters
Maintenance	Easy - Changes managed through UI	Requires script version control and testing
Best for	Simple, static jobs with fixed parameters	Complex workflows, dynamic job configurations, or integration with existing scripts

Recommendation: Start with Method 1 if you're new to periodic scheduling or have simple requirements. Use Method 2 when you need advanced customization, dynamic parameter generation, or integration with existing automation scripts.

Available methods

Method 1: Load a DLC job on a PAI DLC node in the DataWorks console and configure job scheduling
Method 2: Create a script task and configure job scheduling

Prerequisites

The permissions that are required to use DLC are obtained. For more information, see Cloud product dependencies and authorization: DLC.
You have granted DataWorks access to PAI.
Go to the Authorization Page to grant the permissions. For more information about the permission, see AliyunServiceRoleForDataworksEngine. Only an Alibaba Cloud account
or a RAM user with the AliyunDataWorksFullAccess permission can perform this one-click authorization.
You have created a workflow.
In DataStudio, nodes must belong to a workflow. Therefore, you must create a workflow before you can create a node. For instructions, see Create a workflow.

Usage notes

Periodically scheduled PAI DLC nodes create a new DLC task on PAI with each run. This can result in many similarly named tasks that are difficult to distinguish. To prevent this, add date and time variables to your task name. Use scheduling parameters to assign values to these variables, ensuring unique task names. For more information, see Step 2: Develop a PAI DLC task.
DataWorks does not support running PAI DLC tasks on shared resource groups for scheduling.

Note

The examples in this topic use the Singapore region. The user interface may vary in other Regions.

Method 1: Using PAI DLC nodes for periodic scheduling

When to use this method

This method is ideal for:

Simple, static training jobs that don't require dynamic parameter changes
Users who prefer GUI-based configuration over command-line operations
Quick setup and deployment without custom scripting
Teams leveraging existing DLC job configurations

Step 1: Create a DLC job

Log on to the Platform for AI (PAI) console. Go to the Distributed Training Jobs page, and create a DLC job. In this topic, a PyTorch-based DLC job is used as an example. For information about how to create a PyTorch-based DLC job, see Submit a standalone training job that uses PyTorch.

Step 2: Create a PAI DLC node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Right-click the target workflow and choose Create Node > Machine Learning > PAI DLC.
In the Create Node dialog box, configure the Name parameter and click Confirm. Then, you can use the node to develop tasks and configure task scheduling properties.
On the DLC node tab, search for the DLC job that you created from the Load PAI DLC Job by its name and load the job.
After you load the job, the DLC node editor generates node code based on the configurations of the task in PAI. You can modify task configurations based on the code. For more information, see the "Step 2: Develop a PAI DLC task" section in the Create and use a PAI DLC node topic.

Step 3: Configure job scheduling

In the right-side pane of the node tab, click Properties. In the Properties panel, you can view configuration items such as General properties, Scheduling Parameter, Schedule, Resource Group, and Dependencies. Configure the parameters in the Schedule section. DataWorks automatically schedules and runs node tasks based on the specified scheduling cycle. For more information, see Overview.

Note

Before you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.
To prevent multiple tasks that have the same name from being generated in PAI when you use DataWorks to periodically schedule PAI DLC nodes, we recommend that you specify an appropriate scheduling cycle based on your business requirements.

Step 4: Debug the task code

To check whether the node is configured as expected, perform the following operations.

(Optional) Select a resource group for the run and assign values to custom parameters.
- Click the icon in the toolbar. In the Parameters dialog box, select a resource group for debugging.
- If your code uses variables, you can assign test values here for debugging. For more information about the parameter assignment logic, see What are the differences in parameter assignment among Run, Advanced Run, and smoke testing in the development environment?.
Save and run the code.
Click the icon in the toolbar to save your code, and then click the icon to run the task.
(Optional) Perform a smoke test.
To check how the scheduled task will execute, you can perform a smoke test in the development environment. You can run this test when you commit the node or anytime after. For more information, see Perform smoke testing.

Step 5: Submit the task

After you configure the node, you must submit and deploy it. After deploying, the node runs periodically according to its schedule.

Click the icon in the toolbar to save the task.
Click the icon in the toolbar to submit the task.
In the Submit dialog box, enter a Change description. You can also choose whether to perform a code review after the node is submitted.
Note
- You must configure the rerun properties and dependencies for the node before you can submit it.
- Code reviews help ensure code quality and prevent errors from being deployed to the production environment. If you enable code review, a reviewer must approve the submitted code before it can be deployed. For more information, see Code review.

If you are using a workspace in standard mode, you must click Deploy in the upper-right corner of the node editor page after the node is submitted. This deploys the task to the production environment. For more information, see Publish tasks.

Step 6: View operations logs

Once the node is committed and published, it runs periodically according to its schedule. You can click Operation Center where you can view the scheduled task's status. For more information, see Manage scheduled tasks.

Best practices and considerations

Job naming: Use descriptive job names that include the scheduling frequency (e.g., "daily-model-retraining", "weekly-evaluation-job") to easily identify scheduled jobs in the PAI console.
Scheduling cycle: Choose appropriate scheduling intervals based on your data update frequency and computational requirements. Avoid overly frequent scheduling that may cause resource contention.
Resource allocation: Ensure sufficient quota is available for your scheduled jobs, especially for high-frequency or resource-intensive tasks.
Monitoring and alerts: Configure monitoring and alerting for your scheduled jobs to quickly detect and respond to failures or performance issues.
Version control: Track changes to your DLC job configurations, as these will be automatically applied to all future scheduled runs.

Method 2: Using script tasks for periodic scheduling

When to use this method

This method is ideal for:

Complex workflows requiring dynamic job configuration based on runtime conditions
Integration with existing automation scripts and CI/CD pipelines
Advanced parameter generation from external sources or databases
Custom error handling, logging, and notification requirements
Teams with DevOps expertise preferring infrastructure-as-code approaches

Step 1: Create an exclusive resource group for scheduling

Create an exclusive resource group for scheduling in the DataWorks console. For more information, see Create an exclusive resource group for Data Integration.

Step 2: Associate the exclusive resource group with a workspace

Associate the exclusive resource group with a workspace. This way, you can select the resource group in the workspace when you submit a job. For more information, see Step 2: Associate a workspace.

Step 3: Install the DLC package

To install the package, contact technical support to obtain the required permissions.

Create a command

Log on to the DataWorks console. In the left-side navigation pane, click Resource Group.
Find the Data Scheduling exclusive resource group. Click the icon in the Actions column and then click O&M Assistant.
On the O&M Assistant page, click Create Command. Configure the following key parameters and click OK.

Parameter	Description
Command Type	The type of the command. Select Manual Installation.
Command Content	The content of the command. Enter the following content: `wget -P /home/admin/usertools/tools/ https://dlc-release.oss-cn-zhangjiakou.aliyuncs.com/console/public/latest/dlc --no-check-certificate chmod +x /home/admin/usertools/tools/dlc`
Installation Directories	The directory used for installation. Save the command to the /home/admin/usertools/tools/ directory.
Timeout	The timeout period of the command. Unit: seconds. If the command times out, the system forcibly stops the command. We recommend that you set the parameter to 60.

Run a command.
1. On the O&M Assistant page, find the command and click Run command in the Actions column.
2. In the Run command panel, click Run.
View the command execution results.
1. On the O&M Assistant page, find the command and click View Result in the Actions column.
2. In the Command Execution Result dialog box, view the command execution result. If the execution progress is 100%, the DLC package is installed.

Step 4: Create a workflow

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Move the pointer over the icon and choose Create Node > General > Shell. In the Create Node dialog box, configure the Name and Path parameters.
Click Confirm to create the node.

Step 5: Submit a job for testing

To configure automatic job submission at specific points in time, a job node is required. Before you submit a job, create an initial job node and run a smoke test on the node. If an initial node is available, go to Step 6.

Modify the deployment script.

On the tab of the workflow, double-click the created Shell node. In this example, double-click the Deployment node.

On the Shell node tab, enter the following commands:

# Generate a job description file. 
cat << EOF > jobfile
name=dataworks-job
workers=1
worker_spec=ecs.g6.large
worker_image=registry-vpc.cn-hangzhou.aliyuncs.com/pai-dlc/pytorch-training:1.7.1-gpu-py37-cu110-ubuntu18.04
command=echo $(date)
EOF

# Submit the job. 
/home/admin/usertools/tools/dlc submit pytorchjob\
    --access_id=<access_id> \
    --access_key=<access_key> \
    --endpoint=pai-dlc.cn-hangzhou.aliyuncs.com \
    --region=cn-hangzhou \
    --job_file=./jobfile \
    --interactive

jobfile indicates the job description file. For more information about the job configuration, see Commands used to submit jobs. You must configure the endpoint parameter based on the region where you want to deploy the job.

Region	Endpoint
China (Shanghai)	pai-dlc.cn-shanghai.aliyuncs.com
China (Beijing)	pai-dlc.cn-beijing.aliyuncs.com
China (Hangzhou)	pai-dlc.cn-hangzhou.aliyuncs.com
China (Shenzhen)	pai-dlc.cn-shenzhen.aliyuncs.com
China (Hong Kong)	pai-dlc.cn-hongkong.aliyuncs.com
Singapore	pai-dlc.ap-southeast-1.aliyuncs.com
Malaysia (Kuala Lumpur)	pai-dlc.ap-southeast-3.aliyuncs.com
Germany (Frankfurt)	pai-dlc.eu-central-1.aliyun.cs.com

Run the script.
1. In the upper part of the Shell node tab, click the icon.
2. In the Warning message, click Continue to Run.
3. In the Runtime Parameters dialog box, set the Resource Group parameter to the created exclusive resource group.
4. Then, click OK.
  After the script is run, a job is generated. You can go to the DLC page of the default workspace to view the generated job.

Step 6: Perform job scheduling

Run the scheduling job.
1. In the right-side pane of the Shell node tab, click the Properties tab.
2. In the Schedule section of the Properties page, configure the Scheduling Cycle and Rerun parameters.
3. In the Dependencies section, click Use Root Node next to the Parent Nodes field.
4. Configure dependencies. For more information, see Configure same-cycle scheduling dependencies.
5. On the Shell node tab, click the icon to save the configurations.
6. On the Shell node tab, click the icon to commit the scheduled node.
View the instances of the scheduling node.
1. In the upper-right corner of the Shell node tab, click Operation Center.
2. On the Operation Center page, choose Cycle Task Maintenance > Cycle Instance.
3. On the instance list page, view the scheduled time for automatic job submission in the Schedule column.
4. In the Actions column, choose More > View Running Log to view the operations logs of each scheduled job submission.

Best practices and considerations

Script validation: Always test your script thoroughly in a non-production environment before scheduling. Use the "Run" button to perform smoke tests.
Error handling: Implement robust error handling in your scripts to gracefully handle failures and provide meaningful error messages.
Resource management: Ensure your exclusive resource group has sufficient capacity for all scheduled jobs, especially during peak execution times.
Security: Store sensitive credentials (access_id, access_key) securely using DataWorks' credential management features instead of hardcoding them in scripts.
Logging and monitoring: Implement comprehensive logging in your scripts and configure DataWorks monitoring to track job execution and performance.
Version control: Maintain version control for your scheduling scripts to track changes and enable rollback when needed.

Troubleshooting

Common issues and solutions when configuring periodic scheduling for DLC jobs:

Common issues

Permission denied errors: Ensure you have the required permissions to use DLC and DataWorks. Contact your administrator to grant the necessary RAM roles and permissions.
Resource group not available: Verify that your exclusive resource group is properly created and associated with your workspace. Ensure the resource group has sufficient capacity for your scheduled jobs.
DLC package installation failed: Ensure you have the required permissions to install packages on the resource group. Contact technical support if installation continues to fail.
Job submission fails: Verify your job configuration file syntax and ensure all required parameters (access_id, access_key, endpoint, region) are correctly specified.
Scheduled jobs not running: Check the scheduling cycle configuration in the DataWorks Properties panel. Verify that parent node dependencies are properly configured and that the job has been committed successfully.

Debugging tips

Check operation logs: Use the Operation Center in DataWorks to view detailed execution logs for each scheduled instance. Look for error messages and stack traces.
Test with manual run: Before scheduling, manually run your job or script to verify it works correctly. This helps isolate scheduling issues from job configuration issues.
Monitor resource usage: Check CPU, memory, and network usage during job execution to identify resource bottlenecks or configuration issues.
Verify dependencies: Ensure all parent nodes and dependencies are properly configured and complete successfully before your scheduled job runs.

Performance optimization

Optimize scheduling frequency: Avoid overly frequent scheduling that may cause resource contention. Consider your data update patterns and business requirements when setting scheduling intervals.
Right-size resources: Choose appropriate instance types and quantities based on your workload requirements. Monitor actual resource usage and adjust as needed.
Leverage parallel execution: For independent tasks, configure parallel execution in DataWorks to reduce overall pipeline duration.
Implement caching strategies: Cache intermediate results and datasets to avoid redundant computation in subsequent scheduled runs.

References

You can view and manage DLC jobs that are submitted in the PAI console.