You can monitor a resource quota's status and load with a wide range of metrics. You can also configure flexible alert rules and notifications to track resource usage in real time. When a metric, such as CPU utilization, crosses a specified threshold, the system sends an alert notification. This article describes how to use CloudMonitor and ARMS to view monitoring data, configure alert notifications, and subscribe to metrics.
Prerequisites
To monitor a resource quota or create alerts for it, you must first create one. For more information, see Introduction to resource quotas.
Metrics
PAI-Quota provides key performance metrics for GPU, CPU, memory, disk, and network resources. You can view these metrics by quota or by node. For a complete list and detailed descriptions of all metrics, see PAI-Quota metrics.
By quota
Metric | Description |
GPU compute utilization (by quota) | The GPU compute utilization for the specified resource quota. |
GPU memory utilization (by quota) | The GPU memory utilization for the specified resource quota. |
Scheduled GPUs (by quota) | The number of scheduled GPUs for the specified resource quota. |
Total GPUs (by quota) | The total number of GPUs for the specified resource quota. |
GPU power consumption (by quota) | The GPU power consumption for the specified resource quota. |
Scheduled CPU cores (by quota) | The number of scheduled CPU cores for the specified resource quota. |
Total CPU cores (by quota) | The total number of CPU cores for the specified resource quota. |
CPU utilization (by quota) | The CPU utilization for the specified resource quota. |
Memory usage (by quota) | The memory usage for the specified resource quota. |
For more metrics, see PAI-Quota metrics. | |
By node
Metric | Description |
GPU compute utilization (by node) | The GPU compute utilization for the specified node. |
GPU memory utilization (by node) | The GPU memory utilization for the specified node. |
Scheduled GPUs (by node) | The number of scheduled GPUs for the specified node. |
Total GPUs (by node) | The total number of GPUs for the specified node. |
GPU power consumption (by node) | The GPU power consumption for the specified node. |
Scheduled CPU cores (by node) | The number of scheduled CPU cores for the specified node. |
Total CPU cores (by node) | The total number of CPU cores for the specified node. |
CPU utilization (by node) | The CPU utilization for the specified node. |
Memory usage (by node) | The memory usage for the specified node. |
For more metrics, see PAI-Quota metrics. | |
View monitoring dashboards
Log on to the PAI console. On the resource quota's details page, click the Monitoring tab to view its monitoring information.

The monitoring page displays metrics by quota and by node, covering GPU, CPU, memory, network, and disk usage. (Note: Monitoring data is retained for 30 days.)
Click More to select key metrics based on your business needs. You can drag and drop metrics to reorder them, allowing you to focus on core data and create personalized comparisons.
The monitoring charts allow you to zoom in on a selected area, undo the last zoom action, reset the view to its initial state, and download chart data.

Chart Synchronization: When enabled, zooming is synchronized across all charts, making it easy to compare multiple views.

You can customize the number of charts displayed per row.
Use CloudMonitor
CloudMonitor is a service that monitors Alibaba Cloud resources and internet applications. It provides a one-stop, out-of-the-box monitoring solution for enterprises. You can log on to the CloudMonitor console to view PAI-Quota monitoring data and set up alert notifications. CloudMonitor also provides APIs that let you subscribe to metric data and build your own monitoring systems and dashboards. For more information, see What is CloudMonitor?.
Billing
Using CloudMonitor may incur fees. For detailed billing information, see CloudMonitor billing.
View monitoring data
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, choose .
On the Cloud Service Monitoring page, select PAI-Quota. In the search box, select or enter a resource quota name. The corresponding monitoring charts are displayed below.
You can perform the following operations on the monitoring charts:
Switch monitoring dimensions: You can view metrics by quota and by node.

Change the time range:

Zoom in: Click the zoom icon
in the upper-right corner of a chart to view a detailed view of its monitoring data.
Configure alert rules
Use the alerting feature to monitor resource usage within your resource quotas and configure flexible alert rules. If resource usage fluctuates and crosses a configured threshold, the system sends an alert notification. Follow these steps to configure alert notifications in the CloudMonitor console:
Step 1: Configure an alert contact
-
Log on to the Cloud Monitor console.
-
In the left-side navigation pane, choose .
On the Alert Contacts tab, click Create Alert Contact, enter the contact's name, phone number, email address, or webhook URL, and click OK.
On the Alert Contact Group tab, click Create Alert Contact Group, enter a group name and add existing alert contacts to the group, and then click OK.
Step 2: Configure an alert rule
-
In the left-side navigation pane of the CloudMonitor console, choose Cloud Service Monitoring.
On the Cloud Service Monitoring page, search for and go to PAI-Quota.

On the PAI-Quota page, select the region where your service is located, and then click Create Alert Rule.
In the Create Alert Rule panel, configure the following parameters and click OK.
Parameter
Description
Product
The name of the service managed by CloudMonitor. Select PAI-Quota.
Resource scope
The scope of the alert rule. Options are All Resources, Application Group, and Instances.
All Resources: An alert is sent if any resource meets the rule's condition.
Instances: Select the specific resource quotas (Associated Resources) to which the rule applies. An alert is triggered only when one or more of the selected instances meet the alert condition.
Rule description
The condition that triggers the alert. An alert is sent when monitoring data meets this condition. For information about how to configure an alert rule, see Create an alert rule.
Mute for
The interval for resending notifications for an unresolved alert.
Effective period
The time period during which the alert rule is active. The system checks monitoring data for alert conditions only during this period.
Alert contact group
The contact group that receives alert notifications. Select a group that has alert contacts assigned to it.
Tag
A key-value pair used to tag the alert rule.
On the PAI-Quota page, click View Alert Rules to see the details of created alert rules, view alert history, and modify rules.
You can also use API operations to configure alert notifications. These operations let you view alert history, manage alert templates, and configure alert rules and contacts. For more information, see CloudMonitor API reference: Alerts.
Subscribe to metrics
CloudMonitor provides a comprehensive API service that lets you subscribe to resource quota metrics. You can use this service to build your own monitoring systems and dashboards. For more information, see Cloud Service Monitoring API reference.
|
CloudMonitor API |
Overview |
|
Queries the latest monitoring data of a metric. |
|
|
Queries monitoring data of a metric for a cloud service. |
|
|
Queries monitoring data of a metric for a cloud service. |
|
|
Queries details of metrics available in CloudMonitor. |
|
|
Queries cloud services that support time series metrics in CloudMonitor. |
|
|
Queries the latest monitoring data of a metric for a cloud service, sorted by value. |
The following example shows how to use the DescribeMetricList API operation to query metric data.
Go to the PAI-Quota metrics page.
In the row of the target metric, choose Actions > Get Metric Data.

In OpenAPI Explorer, configure the following key parameters and leave the others at their default settings. For more information about the parameters, see DescribeMetricList.
Parameter
Description
Namespace
Set this to
acs_pai_quota.MetricName
The name of the metric to query. For example,
QUOTA_CPU_REQUEST.StartTime
The start of the time range for the query. For example, 2024-05-15 00:00:00.
EndTime
The end of the time range for the query. For example, 2024-05-28 00:00:00.
NoteThe time range between StartTime and EndTime cannot exceed 31 days.
After you configure the parameters, click Initiate Call to view the metric data for the specified time range.
Use ARMS
Application Real-Time Monitoring Service (ARMS) is an Alibaba Cloud-native observability platform. With ARMS, you can create a custom Grafana dashboard for PAI-Quota and configure Prometheus alert rules to monitor its metric data. For more information, see What is ARMS?.
Billing
Fees may be incurred when you use ARMS. For detailed billing information, see ARMS billing.
Integrate monitoring data
Follow these steps:
-
Log on to the ARMS console.
In the left-side navigation pane, click Integration Center.
On the Integration Center page, click the Artificial Intelligence tab on the left, and then click Alibaba Cloud PAI-Quota Service.

(Optional) In the panel that appears, you can preview the monitoring dashboard and review the collected metrics and alert rule templates.
Preview
Click the Preview tab to view the metrics dashboard.

Collect metrics
Click the Collect Metrics tab to view the list of collected metrics.

Alert rule template
Click the Alert Rule Template tab to view the predefined alert rule templates.

On the Start integration tab, configure the following parameters and click OK.
Parameter
Description
Select region for data storage
Select the region where you want to store the data.
Integration name
Configure an integration name for the service as prompted on the console.
Integrating the PAI-Quota monitoring data takes about 1 to 2 minutes.
After the integration is complete, you can click Integration Management to view details of the integrated environment.
View the Grafana dashboard
Log on to the ARMS console. In the left-side navigation pane, choose Integration Management. On the Integrated Environments > Cloud Service Environment tab, click the name of the environment.
On the Component Management tab, click Dashboard in the Addon Type section to view the built-in Grafana dashboard.

Click the dashboard name to view the monitoring dashboard.

Configure Prometheus alerts
You can configure monitoring alerts by using Prometheus. Follow these steps:
Log on to the ARMS console. In the left-side navigation pane, choose Integration Management. On the Integrated Environments > Cloud Service Environment tab, click the name of the environment.
On the Component Management page, click Alert Rule in the Addon Type section to view the built-in alert rules.

The built-in alert rules generate alert events but do not send notifications. You can configure notifications to be sent to email or other platforms by using one of the following two methods:
Configure a notification policy to define matching rules for alert events. When a rule is triggered, the system sends an alert to the specified contact using your chosen method. For more information, see Notification policies.
Edit an alert rule to configure its notification method.

On the page for editing a Prometheus alert rule, you can also customize alert conditions, duration, content, and notifications. For detailed configuration information, see Create a Prometheus alert rule.