All Products
Search
Document Center

Platform For AI:Resource quota monitoring and alerting

Last Updated:May 07, 2026

You can monitor a resource quota's status and load with a wide range of metrics. You can also configure flexible alert rules and notifications to track resource usage in real time. When a metric, such as CPU utilization, crosses a specified threshold, the system sends an alert notification. This article describes how to use CloudMonitor and ARMS to view monitoring data, configure alert notifications, and subscribe to metrics.

Prerequisites

To monitor a resource quota or create alerts for it, you must first create one. For more information, see Introduction to resource quotas.

Metrics

PAI-Quota provides key performance metrics for GPU, CPU, memory, disk, and network resources. You can view these metrics by quota or by node. For a complete list and detailed descriptions of all metrics, see PAI-Quota metrics.

By quota

Metric

Description

GPU compute utilization (by quota)

The GPU compute utilization for the specified resource quota.

GPU memory utilization (by quota)

The GPU memory utilization for the specified resource quota.

Scheduled GPUs (by quota)

The number of scheduled GPUs for the specified resource quota.

Total GPUs (by quota)

The total number of GPUs for the specified resource quota.

GPU power consumption (by quota)

The GPU power consumption for the specified resource quota.

Scheduled CPU cores (by quota)

The number of scheduled CPU cores for the specified resource quota.

Total CPU cores (by quota)

The total number of CPU cores for the specified resource quota.

CPU utilization (by quota)

The CPU utilization for the specified resource quota.

Memory usage (by quota)

The memory usage for the specified resource quota.

For more metrics, see PAI-Quota metrics.

By node

Metric

Description

GPU compute utilization (by node)

The GPU compute utilization for the specified node.

GPU memory utilization (by node)

The GPU memory utilization for the specified node.

Scheduled GPUs (by node)

The number of scheduled GPUs for the specified node.

Total GPUs (by node)

The total number of GPUs for the specified node.

GPU power consumption (by node)

The GPU power consumption for the specified node.

Scheduled CPU cores (by node)

The number of scheduled CPU cores for the specified node.

Total CPU cores (by node)

The total number of CPU cores for the specified node.

CPU utilization (by node)

The CPU utilization for the specified node.

Memory usage (by node)

The memory usage for the specified node.

For more metrics, see PAI-Quota metrics.

View monitoring dashboards

Log on to the PAI console. On the resource quota's details page, click the Monitoring tab to view its monitoring information.

image

  1. The monitoring page displays metrics by quota and by node, covering GPU, CPU, memory, network, and disk usage. (Note: Monitoring data is retained for 30 days.)

  2. Click More to select key metrics based on your business needs. You can drag and drop metrics to reorder them, allowing you to focus on core data and create personalized comparisons.

  3. The monitoring charts allow you to zoom in on a selected area, undo the last zoom action, reset the view to its initial state, and download chart data.

    image

  4. Chart Synchronization: When enabled, zooming is synchronized across all charts, making it easy to compare multiple views.

    image

  5. You can customize the number of charts displayed per row.

Use CloudMonitor

CloudMonitor is a service that monitors Alibaba Cloud resources and internet applications. It provides a one-stop, out-of-the-box monitoring solution for enterprises. You can log on to the CloudMonitor console to view PAI-Quota monitoring data and set up alert notifications. CloudMonitor also provides APIs that let you subscribe to metric data and build your own monitoring systems and dashboards. For more information, see What is CloudMonitor?.

Billing

Using CloudMonitor may incur fees. For detailed billing information, see CloudMonitor billing.

View monitoring data

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Visualization > Cloud Service Monitoring Dashboard.

  3. On the Cloud Service Monitoring page, select PAI-Quota. In the search box, select or enter a resource quota name. The corresponding monitoring charts are displayed below.

    You can perform the following operations on the monitoring charts:

    • Switch monitoring dimensions: You can view metrics by quota and by node.image

    • Change the time range:image

    • Zoom in: Click the zoom icon image.png in the upper-right corner of a chart to view a detailed view of its monitoring data.image

Configure alert rules

Use the alerting feature to monitor resource usage within your resource quotas and configure flexible alert rules. If resource usage fluctuates and crosses a configured threshold, the system sends an alert notification. Follow these steps to configure alert notifications in the CloudMonitor console:

Step 1: Configure an alert contact

  1. Log on to the Cloud Monitor console.

  2. In the left-side navigation pane, choose Alerts > Alert Contacts.

  3. On the Alert Contacts tab, click Create Alert Contact, enter the contact's name, phone number, email address, or webhook URL, and click OK.

  4. On the Alert Contact Group tab, click Create Alert Contact Group, enter a group name and add existing alert contacts to the group, and then click OK.

Step 2: Configure an alert rule

  1. In the left-side navigation pane of the CloudMonitor console, choose Cloud Service Monitoring.

  2. On the Cloud Service Monitoring page, search for and go to PAI-Quota.image

  3. On the PAI-Quota page, select the region where your service is located, and then click Create Alert Rule.

  4. In the Create Alert Rule panel, configure the following parameters and click OK.

    Parameter

    Description

    Product

    The name of the service managed by CloudMonitor. Select PAI-Quota.

    Resource scope

    The scope of the alert rule. Options are All Resources, Application Group, and Instances.

    • All Resources: An alert is sent if any resource meets the rule's condition.

    • Instances: Select the specific resource quotas (Associated Resources) to which the rule applies. An alert is triggered only when one or more of the selected instances meet the alert condition.

    Rule description

    The condition that triggers the alert. An alert is sent when monitoring data meets this condition. For information about how to configure an alert rule, see Create an alert rule.

    Mute for

    The interval for resending notifications for an unresolved alert.

    Effective period

    The time period during which the alert rule is active. The system checks monitoring data for alert conditions only during this period.

    Alert contact group

    The contact group that receives alert notifications. Select a group that has alert contacts assigned to it.

    Tag

    A key-value pair used to tag the alert rule.

  5. On the PAI-Quota page, click View Alert Rules to see the details of created alert rules, view alert history, and modify rules.

You can also use API operations to configure alert notifications. These operations let you view alert history, manage alert templates, and configure alert rules and contacts. For more information, see CloudMonitor API reference: Alerts.

Subscribe to metrics

CloudMonitor provides a comprehensive API service that lets you subscribe to resource quota metrics. You can use this service to build your own monitoring systems and dashboards. For more information, see Cloud Service Monitoring API reference.

CloudMonitor API

Overview

DescribeMetricLast

Queries the latest monitoring data of a metric.

DescribeMetricList

Queries monitoring data of a metric for a cloud service.

DescribeMetricData

Queries monitoring data of a metric for a cloud service.

DescribeMetricMetaList

Queries details of metrics available in CloudMonitor.

DescribeProjectMeta

Queries cloud services that support time series metrics in CloudMonitor.

DescribeMetricTop

Queries the latest monitoring data of a metric for a cloud service, sorted by value.

The following example shows how to use the DescribeMetricList API operation to query metric data.

  1. Go to the PAI-Quota metrics page.

  2. In the row of the target metric, choose Actions > Get Metric Data.image

  3. In OpenAPI Explorer, configure the following key parameters and leave the others at their default settings. For more information about the parameters, see DescribeMetricList.

    Parameter

    Description

    Namespace

    Set this to acs_pai_quota.

    MetricName

    The name of the metric to query. For example, QUOTA_CPU_REQUEST.

    StartTime

    The start of the time range for the query. For example, 2024-05-15 00:00:00.

    EndTime

    The end of the time range for the query. For example, 2024-05-28 00:00:00.

    Note

    The time range between StartTime and EndTime cannot exceed 31 days.

  4. After you configure the parameters, click Initiate Call to view the metric data for the specified time range.

Use ARMS

Application Real-Time Monitoring Service (ARMS) is an Alibaba Cloud-native observability platform. With ARMS, you can create a custom Grafana dashboard for PAI-Quota and configure Prometheus alert rules to monitor its metric data. For more information, see What is ARMS?.

Billing

Fees may be incurred when you use ARMS. For detailed billing information, see ARMS billing.

Integrate monitoring data

Follow these steps:

  1. Log on to the ARMS console.

  2. In the left-side navigation pane, click Integration Center.

  3. On the Integration Center page, click the Artificial Intelligence tab on the left, and then click Alibaba Cloud PAI-Quota Service.image

  4. (Optional) In the panel that appears, you can preview the monitoring dashboard and review the collected metrics and alert rule templates.

    Preview

    Click the Preview tab to view the metrics dashboard.image

    Collect metrics

    Click the Collect Metrics tab to view the list of collected metrics.image

    Alert rule template

    Click the Alert Rule Template tab to view the predefined alert rule templates.image

  5. On the Start integration tab, configure the following parameters and click OK.

    Parameter

    Description

    Select region for data storage

    Select the region where you want to store the data.

    Integration name

    Configure an integration name for the service as prompted on the console.

    Integrating the PAI-Quota monitoring data takes about 1 to 2 minutes.

  6. After the integration is complete, you can click Integration Management to view details of the integrated environment.

View the Grafana dashboard

  1. Log on to the ARMS console. In the left-side navigation pane, choose Integration Management. On the Integrated Environments > Cloud Service Environment tab, click the name of the environment.

  2. On the Component Management tab, click Dashboard in the Addon Type section to view the built-in Grafana dashboard.image

  3. Click the dashboard name to view the monitoring dashboard.d3bae3f2d8c2bc286812e5969e1b9118

Configure Prometheus alerts

You can configure monitoring alerts by using Prometheus. Follow these steps:

  1. Log on to the ARMS console. In the left-side navigation pane, choose Integration Management. On the Integrated Environments > Cloud Service Environment tab, click the name of the environment.

  2. On the Component Management page, click Alert Rule in the Addon Type section to view the built-in alert rules.image

  3. The built-in alert rules generate alert events but do not send notifications. You can configure notifications to be sent to email or other platforms by using one of the following two methods:

    • Configure a notification policy to define matching rules for alert events. When a rule is triggered, the system sends an alert to the specified contact using your chosen method. For more information, see Notification policies.

    • Edit an alert rule to configure its notification method.image

      On the page for editing a Prometheus alert rule, you can also customize alert conditions, duration, content, and notifications. For detailed configuration information, see Create a Prometheus alert rule.