All Products
Search
Document Center

Platform For AI:Monitor training jobs and configure alerts

Last Updated:Apr 02, 2026

DLC tracks real-time resource usage of training jobs and sends alert notifications when a metric, such as GPU utilization, exceeds a threshold. You can use CloudMonitor or ARMS to view monitoring data, configure alerts, and subscribe to metrics.

Prerequisites

At least one DLC training job is created. For more information, see Create a training job.

Limitations

Monitoring is unavailable for pay-as-you-go training jobs that use general-purpose computing resources.

Accounts and permissions

  • Alibaba Cloud account: Performs all operations without additional authorization.

  • RAM user:

    • To view monitoring data for DLC jobs in a workspace, grant the RAM user these permissions:

      • Add the RAM user as a workspace member with the Administrator, Algorithm Developer, or Algorithm O&M Engineer role. For more information, see Manage workspace members.

      • Grant the RAM user read-only access to CloudMonitor (AliyunCloudMonitorReadOnlyAccess). For more information, see Manage RAM user permissions.

    • To view monitoring data and configure alerts for DLC jobs in a workspace, grant the RAM user these permissions:

      • Add the RAM user as a workspace member with the Administrator, Algorithm Developer, or Algorithm O&M Engineer role. For more information, see Manage workspace members.

      • Grant the RAM user administrative access to CloudMonitor (AliyunCloudMonitorFullAccess). For more information, see Manage RAM user permissions.

Monitoring metrics

Monitoring metrics include GPU, CPU, memory, disk, network, RDMA, and CPFS metrics. Supported dimensions include job, pod (worker), and individual GPU card. The following tables list typical health metrics. For a complete list and detailed descriptions, see Metrics for Deep Learning Containers (DLC).

Job dimension

Metric

Description

CPU utilization (job dimension)

CPU utilization, as a percentage.

Memory utilization (job dimension)

Memory utilization, as a percentage.

Disk read rate (job dimension)

Disk read rate, in MiB/s.

Disk write rate (job dimension)

Disk write rate, in MiB/s.

Network receive rate (job dimension)

Network receive rate, in MiB/s.

Network send rate (job dimension)

Network send rate, in MiB/s.

GPU compute utilization (job dimension)

GPU compute utilization, as a percentage.

GPU memory utilization (job dimension)

GPU memory utilization, as a percentage.

GPU SM utilization (job dimension)

GPU Streaming Multiprocessor (SM) utilization, as a percentage.

GPU power consumption (job dimension)

GPU power consumption, in watts.

GPU temperature (job dimension)

GPU temperature, in degrees Celsius.

Overall GPU health (job dimension)

Overall GPU health. 100% = all GPUs healthy; less than 100% = one or more GPUs abnormal.

RDMA receive rate (job dimension)

RDMA receive rate.

RDMA send rate (job dimension)

RDMA send rate.

CPFS write rate (job dimension)

CPFS write rate, in MiB/s.

CPFS read rate (job dimension)

CPFS read rate, in MiB/s.

NVLink receive volume (job dimension)

Data volume received over NVLink.

NVLink send volume (job dimension)

Data volume sent over NVLink.

PCIe receive volume (job dimension)

Data volume received over PCIe.

PCIe send volume (job dimension)

Data volume sent over PCIe.

For more metrics, see Metrics for Deep Learning Containers (DLC).

Pod (worker) dimension

Metric

Description

CPU utilization (pod dimension)

CPU utilization, as a percentage.

Memory utilization (pod dimension)

Memory utilization, as a percentage.

Disk read rate (pod dimension)

Disk read rate, in MiB/s.

Disk write rate (pod dimension)

Disk write rate, in MiB/s.

Network receive rate (pod dimension)

Network receive rate, in MiB/s.

Network send rate (pod dimension)

Network send rate, in MiB/s.

GPU compute utilization (pod dimension)

GPU compute utilization, as a percentage.

GPU memory utilization (pod dimension)

GPU memory utilization, as a percentage.

GPU SM utilization (pod dimension)

GPU SM utilization, as a percentage.

GPU power consumption (pod dimension)

GPU power consumption, in watts.

GPU temperature (pod dimension)

GPU temperature, in degrees Celsius.

Overall GPU health (pod dimension)

Overall GPU health. 100% = all GPUs healthy; less than 100% = one or more GPUs abnormal.

RDMA receive rate (pod dimension)

RDMA receive rate, in MiB/s.

RDMA send rate (pod dimension)

RDMA send rate, in MiB/s.

CPFS read rate (pod dimension)

CPFS read rate, in MiB/s.

CPFS write rate (pod dimension)

CPFS write rate, in MiB/s.

NVLink receive volume (pod dimension)

Data volume received over NVLink.

NVLink send volume (pod dimension)

Data volume sent over NVLink.

PCIe receive volume (pod dimension)

Data volume received over PCIe.

PCIe send volume (pod dimension)

Data volume sent over PCIe.

For more metrics, see Metrics for Deep Learning Containers (DLC).

GPU card dimension

Metric

Description

GPU memory interface utilization (card dimension)

GPU memory interface utilization per card.

GPU SM utilization (card dimension)

GPU SM utilization per card.

GPU power consumption (card dimension)

GPU power consumption per card, in watts.

GPU temperature (card dimension)

GPU temperature per card, in degrees Celsius.

Overall GPU health (card dimension)

Overall GPU card health. 100% = card healthy; less than 100% = card abnormal.

For more metrics, see Metrics for Deep Learning Containers (DLC).

View monitoring charts

  1. On the job details page, go to the Monitoring tab to view the job's monitoring data. Note: Job monitoring data is retained for up to 30 days.

    image

  2. The Job Level, Instance Dimension, and GPU Level tabs show metrics for GPU, CPU, memory, network, and disk.

  3. Click More to select key metrics and drag them to adjust display priority.

    image

  4. Zoom in on a region, undo a zoom, reset the view, or download the chart.

    image

  5. Chart sync: Enable this feature to synchronize zoom actions across all charts for easier comparison.

    image

  6. Customize the number of charts displayed per row.

Use CloudMonitor

CloudMonitor monitors Alibaba Cloud resources and internet applications. Use the CloudMonitor console to view monitoring data for DLC jobs and configure alert notifications. CloudMonitor also provides APIs for subscribing to metric data to build custom monitoring systems and dashboards. For more information, see What is CloudMonitor?.

Billing

CloudMonitor incurs fees. For billing details, see Billing of CloudMonitor.

View monitoring data

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Visualization > Cloud Service Monitoring Dashboard.

  3. On the Cloud Service Monitoring Dashboard page, select PAI-Deep Learning Containers (DLC) and then select or search for a Workspace ID to view the corresponding monitoring charts. To find a workspace ID, see Manage workspaces.

    image

    On the monitoring charts, perform the following operations:

    • Switch monitoring dimensions: The system displays monitoring metrics by job, pod (worker), and GPU dimension.

      • Click the Job dimension tab. Select or enter a DLC job ID to view the monitoring data for a single job.

      • Click the Pod dimension tab. Select or enter a pod ID to view the monitoring data for a single pod.

      • Click the GPU Level tab. Select or enter a pod ID to view GPU-level monitoring data for a single pod.

    • Change the time range: image

    • Zoom in: Click the zoom-in icon image.png in the upper-right corner of each chart to view detailed monitoring data.

Configure alerts

Configure alert rules to monitor DLC job resource levels. When a metric violates a rule, the system sends a notification. Configure alerts through the CloudMonitor console or APIs.

Configure alert contacts

  1. Log on to the Cloud Monitor console.

  2. In the navigation pane on the left, choose Alert Service > Alert Contacts.

  3. On the Alert Contacts tab, click Create Alert Contact. Enter the contact's name, phone number, email address, or webhook URL, then click OK.

  4. On the Alert Contact Group tab, click Create Alert Contact Group. Enter a group name, add existing alert contacts to the group, then click OK.

Configure alert rules

  1. In the left-side navigation pane of the CloudMonitor console, choose Cloud Service Monitoring.

  2. On the Cloud Service Monitoring page, search for and select PAI-Deep Learning Containers (DLC).image

  3. On the PAI-Deep Learning Containers (DLC) page, select the region where your service is located and click Create Alert Rule.

  4. In the Create Alert Rule panel, configure the following parameters and click OK.

    Parameter

    Description

    Product

    Select PAI-Deep Learning Containers (DLC).

    Resource scope

    Scope of the alert rule: All Resources or Instance.

    • All Resources: An alert is sent if any DLC resource meets the alert rule.

    • Instance: In the Associate Resources section, add workspaces. An alert is sent only when a DLC job in those workspaces meets the alert rule.

    Rule

    Conditions that trigger the alert. For more information, see Create an alert rule.

    Mute for

    Interval at which alert notifications are resent if the alert is not resolved.

    Effective period

    Time period during which the alert rule is active.

    Tags

    Custom tag for the alert rule, consisting of a key-value pair.

    Contact group

    Contact group that receives alert notifications.

  5. On the PAI-Deep Learning Containers (DLC) page, click Alert Rules to view the details and history of alert rules.

To configure alert rules programmatically, use CloudMonitor APIs for alert history, templates, rules, and contacts. For details, see CloudMonitor API reference: Alert Service.

Subscribe to monitoring metrics

CloudMonitor provides APIs for subscribing to DLC monitoring metrics to build custom monitoring systems and dashboards. For the procedure, see Cloud Service Monitoring API directory.

CloudMonitor API

Overview

DescribeMetricLast

Queries the latest monitoring data of a metric.

DescribeMetricList

Queries monitoring data of a metric for a cloud service.

DescribeMetricData

Queries monitoring data of a metric for a cloud service.

DescribeMetricMetaList

Queries details of metrics available in CloudMonitor.

DescribeProjectMeta

Queries cloud services that support time series metrics in CloudMonitor.

DescribeMetricTop

Queries the latest monitoring data of a metric for a cloud service, sorted by value.

The following example uses the DescribeMetricList API to query monitoring data for a DLC metric.

  1. Go to the Metrics for Deep Learning Containers (DLC) page.

  2. In the metrics list, find the target metric and click Get Metric Data in the Actions column.image

  3. On the OpenAPI Explorer page, configure the following key parameters and leave the others at their default values. For more information, see DescribeMetricList.

    Parameter

    Description

    Namespace

    Set this to acs_pai_dlc.

    MetricName

    Target monitoring metric. For example, CARD_GPU_DRAM_ACTIVE_UTIL.

    StartTime

    Start time. For example, 2024-05-15 00:00:00.

    EndTime

    End time. For example, 2024-05-28 00:00:00.

    Note

    The interval between StartTime and EndTime must be 31 days or less.

  4. After you configure the parameters, click Initiate Call to view the monitoring data for the specified time range.

Use ARMS

Application Real-Time Monitoring Service (ARMS) is an observability platform. Use ARMS to create Grafana dashboards and Prometheus alert rules for DLC distributed training jobs. For more information, see What is Application Real-Time Monitoring Service (ARMS)?.

Billing

ARMS incurs fees. For billing details, see Billing of ARMS.

Integrate monitoring data

To integrate DLC monitoring data into ARMS:

  1. Log on to the ARMS console, and in the left-side navigation pane, click Integration Center.

  2. On the Integration Center page, click the Artificial Intelligence tab, then click Alibaba Cloud PAI-DLC Distributed Training Service.image

  3. In the panel that appears, on the Start Provisioning tab, select a Data Storage Region, enter an Integration Name, and then click OK.

    The integration takes about 1 to 2 minutes. Switch to the Effect Preview, Collected Metrics, and Alert Rule Templates tabs to view the metric dashboard, supported metrics, and alert rule templates.

  4. After the integration is complete, go to the Provisioning page to view the integrated environment details.

View Grafana dashboards

  1. Log on to the ARMS console. In the left-side navigation pane, select Provisioning. On the Provisioned Environments > Cloud Service Region Environment tab, click the environment name.

  2. On the Component Management tab, in the Component Type section, select Alibaba Cloud PAI-DLC Distributed Training Service, and then click Dashboards on the right to view the built-in Grafana dashboards.image

  3. Click a dashboard name to view the monitoring dashboard.image

Configure Prometheus alerts

Configure Prometheus alert rules for DLC training jobs:

  1. Log on to the ARMS console. In the left-side navigation pane, select Provisioning. On the Provisioned Environments > Cloud Service Region Environment tab, click the environment name.

  2. On the Component Management tab, in the Component Type list, select Alibaba Cloud PAI-DLC Distributed Training Service and click Alert Rules to view the built-in alert rules.image

  3. The built-in alert rules generate events but do not send notifications. Configure notifications in one of the following ways:

    • Set up a notification policy with matching rules for alert events. When a rule matches, the system sends an alert to the specified recipient. For more information, see Notification policies.

    • Edit the alert rule to configure the notification method.image On the Prometheus alert rule editing page, customize alert conditions, duration, content, and notifications. For details, see Create a Prometheus alert rule.image