Monitor and configure alerts for PAI training jobs using CloudMonitor or ARMS. - Platform For AI

Prerequisites

To configure monitoring and alarms for DLC training jobs, you must create one or more DLC training jobs. For more information, see Create a training job.

Limitations

Monitoring is unavailable for pay-as-you-go training jobs that use general-purpose computing resources.

Accounts and permissions

Alibaba Cloud account: Allows you to perform all operations without additional authorization.
RAM user:
- To view monitoring data for a DLC job in a workspace, the RAM user needs the following permissions:
  - Add the RAM user as a workspace member and assign the Administrator, Algorithm Developer, or Algorithm O&M Engineer role. For more information, see Manage workspace members.
  - Grant the RAM user read-only access to CloudMonitor (AliyunCloudMonitorReadOnlyAccess). For more information, see Manage RAM user permissions.
- To view monitoring data for a DLC job in a workspace and configure monitoring alerts, the RAM user needs the following permissions:
  - Add the RAM user as a workspace member and assign the Administrator, Algorithm Developer, or Algorithm O&M Engineer role. For more information, see Manage workspace members.
  - Grant the RAM user administrative access to CloudMonitor (AliyunCloudMonitorFullAccess). For more information, see Manage RAM user permissions.

Monitoring metrics

Monitoring metrics include GPU, CPU, memory, disk, network, RDMA, and CPFS. Supported dimensions are job, pod (worker), and individual GPU card. The following tables list typical health metrics. For a complete list of metrics and their detailed descriptions, see Metrics for Deep Learning Containers (DLC).

Job

Metric	Description
CPU utilization (job dimension)	The percentage of total CPU resources used by the job.
Memory utilization (job dimension)	The percentage of total memory resources used by the job.
Disk read data rate (job dimension)	The rate at which data is read from disk for the job, in MiB/s.
Disk write data rate (job dimension)	The rate at which data is written to disk for the job, in MiB/s.
Network receive data rate (job dimension)	The rate at which the job receives data, in MiB/s.
Network send data rate (job dimension)	The rate at which the job sends data, in MiB/s.
GPU compute utilization (job dimension)	The percentage of total GPU compute resources used by the job.
GPU memory utilization (job dimension)	The percentage of total GPU memory used by the job.
GPU SM utilization (job dimension)	The percentage of total GPU Streaming Multiprocessor (SM) resources used by the job.
GPU power consumption (job dimension)	The GPU power consumption of the job, in watts.
GPU temperature (job dimension)	The GPU temperature of the job, in degrees Celsius.
Overall GPU health (job dimension)	The overall health of GPUs in the job. A value of 100% indicates that all GPUs are healthy. A value less than 100% indicates that one or more GPUs are abnormal.
RDMA receive data rate (job dimension)	The RDMA receive data rate for the job, in MiB/s.
RDMA send data rate (job dimension)	The RDMA send data rate for the job, in MiB/s.
CPFS write data rate (job dimension)	The rate at which data is written to CPFS for the job, in MiB/s.
CPFS read data rate (job dimension)	The rate at which data is read from CPFS for the job, in MiB/s.
NVLink receive data volume (job dimension)	The data volume received over NVLink by the GPUs in the job.
NVLink send data volume (job dimension)	The data volume sent over NVLink by the GPUs in the job.
PCIe receive data volume (job dimension)	The data volume received over PCIe by the GPUs in the job.
PCIe send data volume (job dimension)	The data volume sent over PCIe by the GPUs in the job.
For more metrics, see Metrics for Deep Learning Containers (DLC).

Pod (worker)

Metric	Description
CPU utilization (pod dimension)	The percentage of total CPU resources used by the pod.
Memory utilization (pod dimension)	The percentage of total memory resources used by the pod.
Disk read data rate (pod dimension)	The rate at which data is read from disk for the pod, in MiB/s.
Disk write data rate (pod dimension)	The rate at which data is written to disk for the pod, in MiB/s.
Network receive data rate (pod dimension)	The rate at which the pod receives data, in MiB/s.
Network send data rate (pod dimension)	The rate at which the pod sends data, in MiB/s.
GPU compute utilization (pod dimension)	The percentage of total GPU compute resources used by the pod.
GPU memory utilization (pod dimension)	The percentage of total GPU memory used by the pod.
GPU SM utilization (pod dimension)	The percentage of total GPU Streaming Multiprocessor (SM) resources used by the pod.
GPU power consumption (pod dimension)	The GPU power consumption of the pod, in watts.
GPU temperature (pod dimension)	The GPU temperature of the pod, in degrees Celsius.
Overall GPU health (pod dimension)	The overall health of GPUs in the pod. A value of 100% indicates that all GPUs are healthy. A value less than 100% indicates that one or more GPUs are abnormal.
RDMA receive data rate (pod dimension)	The RDMA receive data rate for the pod, in MiB/s.
RDMA send data rate (pod dimension)	The RDMA send data rate for the pod, in MiB/s.
CPFS read data rate (pod dimension)	The rate at which data is read from CPFS for the pod, in MiB/s.
CPFS write data rate (pod dimension)	The rate at which data is written to CPFS for the pod, in MiB/s.
NVLink receive data volume (pod dimension)	The data volume received over NVLink by the GPUs in the pod.
NVLink send data volume (pod dimension)	The data volume sent over NVLink by the GPUs in the pod.
PCIe receive data volume (pod dimension)	The data volume received over PCIe by the GPUs in the pod.
PCIe send data volume (pod dimension)	The data volume sent over PCIe by the GPUs in the pod.
For more metrics, see Metrics for Deep Learning Containers (DLC).

GPU card

Metric	Description
GPU memory interface utilization (card dimension)	The percentage of GPU memory interface capacity used on an individual GPU card.
GPU SM utilization (card dimension)	The percentage of GPU SM capacity used on an individual GPU card.
GPU power consumption (card dimension)	The power consumption of an individual GPU card, in watts.
GPU temperature (card dimension)	The temperature of an individual GPU card, in degrees Celsius.
Overall GPU health (card dimension)	The overall health of an individual GPU card. A value of 100% indicates that the card is healthy. A value less than 100% indicates that the card is abnormal.
For more metrics, see Metrics for Deep Learning Containers (DLC).

Monitoring charts

On the DLC job details page, switch to the Monitoring tab to view the job's monitoring data. (Note: Job monitoring data is retained for up to 30 days.)

The Monitoring tab has three subtabs: job dimension, instance dimension, and GPU dimension. These subtabs display metrics for GPU, CPU, memory, network, disk, and OSS.
You can monitor metrics by Job Level, Instance Dimension, and GPU Level, covering GPU, CPU, memory, network, disk, and OSS.
Click More to select the metrics to display. You can then drag the metrics to reorder them, which helps prioritize key data for comparison.

The dialog box has two sections: Metric Selection and Metric Sorting. Under the GPU group, available metrics include GPU utilization, GPU memory utilization, total GPU memory, and used GPU memory. Under the CPU group, you can select metrics such as CPU utilization. Click OK after making your selections.
On a monitoring chart, you can use region zoom (zoom in), undo zoom (revert the previous zoom), reset (restore the initial view), and download.
chart sync: When enabled, this feature synchronizes zoom actions across all charts, making it easier to compare multiple views.

Click the layout drop-down list on the right and select One per row, Two per row, or Three per row.
You can customize the number of charts displayed per row.

Use CloudMonitor

CloudMonitor monitors Alibaba Cloud resources and internet applications. You can use the CloudMonitor console to view monitoring data for PAI-Deep Learning Containers (DLC) jobs and configure alert notifications. CloudMonitor also provides APIs that allow you to subscribe to metric data to build your own monitoring systems and dashboards. For more information, see What is CloudMonitor?.

Billing

Using the CloudMonitor service incurs fees. For detailed billing information, see Billing of CloudMonitor.

View monitoring data

Log on to the Cloud Monitor console.
In the left-side navigation pane, choose Visualization > Cloud Service Monitoring Dashboard.
On the cloud service monitoring dashboard page, select PAI-Deep Learning Containers (DLC), and then in the search box, select or search for a workspace ID to view the corresponding monitoring charts. To find your workspace ID, see Manage workspaces.

The monitoring charts area displays three GPU metric panels on the job dimension tab: GPU Memory Interface Utilization (Job Dimension) (%), GPU Compute Utilization (Job Dimension) (%), and GPU SM Utilization (Job Dimension) (%). The reporting period for these metrics is 10 seconds.

On the monitoring charts, you can:
- Switch monitoring dimensions: Display metrics by job, pod (worker), or GPU dimension.
  - Click the job dimension tab. Select or enter a DLC job ID to view the monitoring data for a single job.
  - Click the pod dimension tab. Select or enter a pod ID to view the monitoring data for a single pod.
  - Click the GPU Level tab. Select or enter a pod ID to view GPU-dimension monitoring data for a single pod in a specified DLC job.
- Change the time range: You can select 1 hour, 3 hours, 6 hours, 12 hours, 1 day, 3 days, 7 days, 14 days, or a Custom time period.
- Zoom in: Click the zoom-in icon in the upper-right corner of each chart to view detailed monitoring data.

Configure alerts

You can configure alert rules to monitor the resource levels of PAI-Deep Learning Containers (DLC) jobs. If a resource metric violates a rule, the system sends an alert notification. This section describes how to configure alerts by using the CloudMonitor console and APIs.

Configure alert contacts

Log on to the Cloud Monitor console.
In the left-side navigation pane, choose Alerts > Alert Contacts.
On the Alert Contacts tab, click Create Contact, enter the name, phone number, email address, or webhook URL for the contact, and click Confirm.
On the Alert Contact Group tab, click Create Contact Group, enter a name for the group, add existing alert contacts to the group, and then click Confirm.

Configure alert rules

In the left-side navigation pane of the CloudMonitor console, choose Cloud Resource Monitoring > Cloud Service Monitoring.
On the Cloud Service Monitoring page, open PAI-Deep Learning Containers (DLC). In the search box, enter PAI-Deep Learning Containers (DLC). In the search results, under the Metric Monitoring category, click PAI-Deep Learning Containers (DLC).
On the PAI-Deep Learning Containers (DLC) page, select your service region and click Create Alert Rule.

In the Create Alert Rule panel, configure the following parameters and click Confirm.

Parameter	Description
Product	The product to monitor. Select PAI-Deep Learning Containers (DLC).
Resource Scope	The scope of the alert rule. Options: All Resources and instance. All Resources: An alert notification is sent if any DLC resource meets the alert rule. instance: In the Associate Resources section, you must add the workspaces that you want to associate. An alert notification is sent only when a DLC job in the added workspaces meets the alert rule.
Rule Description	The conditions that trigger the alert. An alert is triggered when monitoring data meets these conditions. For more information about how to set a rule description, see Create an alert rule.
Mute For	The resend interval for unresolved alerts.
Effective Period	The period when the alert rule is active. CloudMonitor only checks for alerts during this period.
Tags	Custom tags for the alert rule, specified as key-value pairs.
Alert Contact Group	The contact group that receives alert notifications. Select a group with configured alert contacts.

On the PAI-Deep Learning Containers (DLC) page, click View Alert Rules to view the details and history of your alert rules. You can also modify the rules.

You can also configure the alert service by calling API operations. These operations allow you to view alert history, manage alert templates, configure alert rules, and manage alert contacts. For more information, see CloudMonitor API Reference: alert service.

Subscribe to monitoring metrics

CloudMonitor provides a comprehensive set of API operations that you can use to subscribe to DLC monitoring metrics and data. This allows you to build your own monitoring systems and dashboards. For detailed steps, see Cloud Service Monitoring API Directory.

API	Description
DescribeMetricLast	Queries the latest monitoring data of a specified metric.
DescribeMetricList	Queries the monitoring data of a specified metric for a specified cloud service.
DescribeMetricData	Queries the monitoring data of a metric for a specified cloud service.
DescribeMetricMetaList	Lists available metrics and their metadata.
DescribeProjectMeta	Lists cloud services that provide time-series metrics.
DescribeMetricTop	Queries the latest monitoring data of a specified metric for a specified cloud service, and then queries the sorted monitoring data of the metric.

The following example shows how to call the DescribeMetricList API operation to query the monitoring data of a specified metric for PAI-Deep Learning Containers (DLC).

Go to the Metrics for PAI-Deep Learning Containers (DLC) page.
On the metrics list page, find the target metric, such as JOB_GPU_ACCELERATOR_DUTTY_UTIL, and click Get Metric Data in the Actions column to go to the OpenAPI Portal page.

On the OpenAPI Portal page, configure the following key parameters and leave the others at their default values. For more information about the parameters, see DescribeMetricList.

Parameter	Description
Namespace	Set this parameter to `acs_pai_dlc`.
MetricName	The metric to query. Example: `CARD_GPU_DRAM_ACTIVE_UTIL`.
StartTime	The start time. Example: `2024-05-15 00:00:00`.
EndTime	The end time. Example: `2024-05-28 00:00:00`. Note The interval between `StartTime` and `EndTime` must be 31 days or less.

After you configure the parameters, click Initiate Call to view the monitoring data for the specified time range. A successful call returns a 200 status code. The Datapoints array in the response body contains data fields such as timestamp, jobId, regionId, userId, workspaceId, and Value.

Using ARMS

Application Real-Time Monitoring Service (ARMS) is an Alibaba Cloud cloud-native observability platform. With ARMS, you can create custom Grafana dashboards and configure flexible alert rules using Prometheus to comprehensively monitor your DLC job metrics. For more information, see What is Application Real-Time Monitoring Service (ARMS)?.

Billing

Using ARMS incurs fees. For billing details, see ARMS billing.

Integrate monitoring data

To integrate DLC monitoring data into ARMS:

Log on to the ARMS console, and in the left-side navigation pane, click Integration Center.
On the Integration Center page, click the Artificial Intelligence tab, and then click PAI-DLC Distributed Training Service.
In the panel that appears, on the Start Provisioning tab, select a Data Storage Region, enter an Integration Name, and then click OK.

The integration takes about 1 to 2 minutes. You can also switch to the Effect Preview, Collected Metrics, and Alert Rule Templates tabs to view the metric dashboard, supported metrics, and alert rule names and template details.
After the integration is complete, you can click Provisioning to view the details of the provisioned environment.

Grafana dashboard

Log on to the ARMS console. In the left-side navigation pane, select Provisioning. On the Provisioned Environments > Cloud Service Region Environment tab, click the environment name.
On the Component Management tab, in the Component Type section, select PAI-DLC Distributed Training Service, and then click Dashboards on the right to view the built-in Grafana dashboards.
Click a dashboard name to view the monitoring dashboard. The PAI-DLC Distributed Training Service - Instance Details dashboard provides filters for workspaceId, jobId, pod, and gpu. It organizes metrics into Job Dimension, Card Dimension, and Pod Dimension panels. The Pod Dimension panel shows a pod details table, including CPU utilization, disk I/O rates, and memory usage, along with time-series charts for CPFS read latency, CPFS write data volume, and CPFS read data volume.

Configure Prometheus alerts

You can configure monitoring alerts using Prometheus as follows:

Log on to the ARMS console. In the left-side navigation pane, select Provisioning. On the Provisioned Environments > Cloud Service Region Environment tab, click the environment name.
On the Component Management tab, in the Component Type list, select PAI-DLC Distributed Training Service and click Alert Rules to view the built-in alert rules. By default, these rules are in the Stopped state.
The built-in alert rules generate events but do not send notifications. You can configure notifications to be sent to an email address or other platforms in one of the following two ways:
- Set up a notification policy. This policy defines matching rules for alert events. When an alert event matches a rule, the system sends a notification to the specified recipient through your chosen notification method. For more information, see Notification Policies.
- Edit the alert rule to configure the notification method. On the alert rule management page, select the target component type from the left-side navigation pane, such as PAI-DLC Distributed Training Service, PAI-DSW, PAI-Quota Service, or PAI-Quota (non-Lingjun). Find the target rule in the list and click Edit. On the Prometheus alert rule editing page, you can customize the alert conditions, duration, content, and notifications. For more information, see Create a Prometheus alert rule. When editing the rule, set Check Type to Custom PromQL. In the Custom PromQL Statement field, enter an expression, such as AliyunPaidlc_POD_STATE_ACTIVE{} > 80. Set the Duration to 2 minutes and the Alert Level to P2. In the Alert Notification section, select Simple Mode and configure the Recipient, Notification Period (from 00:00 to 23:59), and Repeat Policy. After you complete the configuration, click Done.