View a training job's basic information, configuration, events, resources, and logs to monitor execution and status. You can search by job name or ID to quickly switch between running and historical instances.
Basic job information and configuration
-
Log on to the PAI console, select a region and workspace, and click Enter DLC.
-
Click a job name to open the overview page.
-
On the Overview tab, view the job's basic, environment, and resource information. Key details also appear at the top of the page.

Job events
Event logs track job scheduling and resource allocation progress. Use events to identify and troubleshoot issues.
-
To view job-level events:
Go to the Event tab.

-
To view node-level events:
On the Overview tab, go to the Instance section. Find an instance and click Log in the Actions column. Go to the System Log tab to view node event details.

Resource view
The resource view displays GPU utilization, GPU memory usage, CPU utilization, memory usage, and network I/O metrics. Use it to monitor real-time resource consumption and plan optimizations.
Go to the Monitoring tab.

Training jobs created with a resource quota support additional monitoring features:
-
Metrics are available at the Job Dimension, Pod Dimension, and GPU Dimension levels.

-
Filter by time range and metric type. Click More to select which metrics to display and change their order for a personalized monitoring dashboard.

-
DLC also provides monitoring and alerting for training job resources. For more information, see Training monitoring and alerting.
Job logs
To troubleshoot errors or review execution history, view job logs using either of the following methods:
-
On the Overview tab, go to the Instance section. Find an instance and click Log in the Actions column to view the output log of a specific node.

-
Go to the Log tab to search logs by keyword. For more information, see Query aggregated logs by keyword.

Audit logs
PAI integrates with ActionTrail to record DLC action events for your Alibaba Cloud account. Events from the last 90 days are available for viewing and searching. For more information, see ActionTrail.
Restart records
If you enabled Auto Fault Tolerance or Health Check (blocklist and rerun) when creating the job, click the Restart Count value to go to the restart records page. This page displays restart count, restart time, restart reason, restart result, and restart duration.
-
In the restart records list, click Error Details to view detailed information about a specific restart, including restart count, restart time, node name, instance name, error code, error message, and error source.
-
Click View Aggregated Error Details to expand the full list of restart records.
Related topics
Manage training jobs based on job status. For more information, see Manage training jobs.