You can view the capacity and performance information of a CPFS for Lingjun file system to understand its storage usage, read/write throughput, and read/write IOPS. By setting alert rules for important metrics, you can receive prompt notifications about exceptions and handle them quickly. This topic describes the metrics that CPFS for Lingjun supports and how to configure alert rules for them.
Background information
CloudMonitor is a service that monitors Alibaba Cloud resources and internet applications. You can use CloudMonitor to monitor metrics of various cloud resources and set alerts for specific metrics. This provides a complete picture of your resource usage and application status on Alibaba Cloud and lets you handle faults promptly to ensure that your services run smoothly. For more information, see What is CloudMonitor?.
Retention policy of monitoring data
Monitoring data is retained for 90 days. After the retention period expires, the monitoring data is automatically cleared. The retention period starts when data is generated.
Monitoring metrics
CPFS for Lingjun supports comprehensive monitoring of file system capacity, instance performance, and client performance through CloudMonitor. Two sets of monitoring metrics are available: a new version (recommended) and an old version. The new metrics address issues in the old version, such as inconsistent naming and unclear structure, and offer improved usability and maintainability.
New customers: You can use the new metrics directly.
Existing customers: You can continue to use the old metrics to ensure business continuity. However, we recommend that you gradually migrate to the new version.
If you are an existing customer who wants to switch to the new metrics, you must first test them in a test environment.
New version metrics (recommended)
The new monitoring metrics are currently available in the following region: China (Beijing).
Capacity monitoring
Type | Metric | Metric name | Unit | Description |
File system - Standard | BmStdCapacity | Total file system storage capacity for the Intelligent Computing Edition (Standard Specifications) | Bytes (B) | The total storage space of the file system. |
BmStdCapacityUsed | Data usage of a standard CPFS for Lingjun file system | Bytes (B) | The amount of data that is currently used by the file system. | |
BmStdInodeLimit | Maximum number of files for a standard AI Computing Edition file system | Unit | The maximum total number of files and directories that the file system can hold. | |
BmStdInodeAlloc | Number of allocated files in a standard CPFS for Lingjun file system | Unit | The total number of files and directories that are currently allocated (created) in the file system. | |
BmStdInodeUsed | Number of used files in a standard CPFS for Lingjun file system | Item | The total number of files and directories that are currently used in the file system. | |
File system - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics. | BmLargeCapacity | Total storage space for large-specification file systems in the Intelligent Computing Edition | Bytes (B) | The total storage space of the file system. |
BmLargeCapacityUsed | Data volume of file systems for large-scale AI computing | Bytes (B) | The amount of data that is currently used by the file system. | |
BmLargeInodeLimit | Maximum number of files in a large CPFS for Lingjun file system | Unit | The maximum total number of files and directories that the file system can hold. | |
BmLargeInodeAlloc | Number of allocated files in a large CPFS for Lingjun file system | Item | The total number of files and directories that are currently allocated (created) in the file system. | |
BmLargeInodeUsed | File count in the large-scale AI Computing Edition file system | Unit | The total number of files and directories that are currently used in the file system. | |
Fileset - Standard | BmStdFsetCapacityLimit | Capacity quota of a standard CPFS for Lingjun fileset | Bytes (B) | The maximum capacity quota set for a single fileset. |
BmStdFsetCapacityUsed | Current capacity of the standard specification fileset for the AI Computing Edition | Bytes (B) | The capacity that is currently used by a single fileset. | |
BmStdFsetInodeLimit | Standard specifications for the Intelligent Computing Edition: Quota on the number of files per fileset | Unit | The maximum quota for the number of files and directories set for a single fileset. | |
BmStdFsetInodeAlloc | Number of pre-allocated files in a standard CPFS for Lingjun fileset | Unit | The total number of files and directories that are currently pre-allocated for a single fileset. | |
BmStdFsetInodeUsed | Number of files in a standard fileset for the Intelligent Computing Edition | Unit | The number of files and directories that are currently used by a single fileset. | |
Fileset - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics. | BmLargeFsetCapacityLimit | Capacity Quotas for Large Filesets in the Intelligent Computing Edition | Bytes (B) | The maximum available capacity set for a single fileset. |
BmLargeFsetCapacityUsed | Current capacity of the large-specification fileset in the Intelligent Computing Edition | Bytes (B) | The amount of data that is currently used by a single fileset. | |
BmLargeFsetInodeLimit | File count quota of a large CPFS for Lingjun fileset | Unit | The maximum total number of files and directories that can be held in a single fileset. | |
BmLargeFsetInodeAlloc | Number of pre-allocated files in a large CPFS for Lingjun fileset | Unit | The total number of files and directories that are currently allocated (created) for a single fileset. | |
BmLargeFsetInodeUsed | Current file count in large-specification filesets for the AI Computing Edition | Unit | The total number of files and directories that are currently used by a single fileset. |
Performance monitoring
Type | Metric | Metric name | Unit | Description |
File system - Standard | BmStdReadThroughput | Read throughput of a standard CPFS for Lingjun file system | Bytes/s | The average read throughput of the file system in bytes per second during a statistical period. |
BmStdWriteThroughput | Write throughput of the file system for the Standard specification of the Intelligent Computing Edition | Bytes/s | The average write throughput of the file system in bytes per second during a statistical period. | |
BmStdReadIops | File system read IOPS for the Intelligent Computing Edition's Standard Tier | Count/s (IOPS) | The average number of read IOPS per second for the file system during a statistical period. | |
BmStdWriteIops | File System Write IOPS for the Intelligent Computing Edition (Standard Specifications) | Count/s (IOPS) | The average number of write IOPS per second for the file system during a statistical period. | |
BmStdReadLatency | Read latency of the file system for the Intelligent Computing Edition Standard Specification | ms | The average read latency of the file system during a statistical period. | |
BmStdWriteLatency | Write latency of the standard-tier Intelligent Computing Edition file system | ms | The average write latency of the file system during a statistical period. | |
BmStdMetaQps | Metadata QPS of a standard CPFS for Lingjun file system | Count/s (IOPS) | The average number of metadata requests per second for the file system during a statistical period. | |
BmStdMetaLatency | Metadata latency of a standard CPFS for Lingjun file system | ms | The average latency of metadata operations for the file system during a statistical period. | |
File system - Large Large-specification file systems are available only to specific users. If you are not a user of a large-specification file system, ignore the related metrics. | BmLargeReadThroughput | Read throughput of a large CPFS for Lingjun file system | Bytes/s | The average read throughput of the file system in bytes per second during a statistical period. |
BmLargeWriteThroughput | High-specification file system write throughput (Intelligent Computing Edition) | Bytes/s | The average write throughput of the file system in bytes per second during a statistical period. | |
BmLargeReadIops | Read IOPS of a large CPFS for Lingjun file system | Count/s (IOPS) | The average number of read IOPS per second for the file system during a statistical period. | |
BmLargeWriteIops | Write IOPS of a large CPFS for Lingjun file system | Count/s (IOPS) | The average number of write IOPS per second for the file system during a statistical period. | |
BmLargeReadLatency | Read latency in large-scale file systems (AI Computing Edition) | ms | The average read latency of the file system during a statistical period. | |
BmLargeWriteLatency | Write latency of large-scale AI Computing Edition file systems | ms | The average write latency of the file system during a statistical period. | |
BmLargeMetaQps | Metadata operation QPS of a large CPFS for Lingjun file system | Count/s (IOPS) | The average number of metadata requests per second for the file system during a statistical period. | |
BmLargeMetaLatency | Metadata operation latency of a large CPFS for Lingjun file system | Microsecond (μs) | The average latency of metadata operations for the file system during a statistical period. | |
Client | ClientReadThroughput | Client read throughput for the Intelligent Computing Edition | Bytes/s | The average read throughput in bytes per second for the client during a statistical period. |
ClientWriteThroughput | Client write throughput for AI Computing Edition | Bytes/s | The average write throughput in bytes per second for the client during a statistical period. | |
ClientReadIops | Client read IOPS on the Intelligent Computing Edition | Count/s (IOPS) | The average number of read IOPS per second for the client during a statistical period. | |
ClientWriteIops | Client Write IOPS for the Intelligent Computing Edition | Count/s (IOPS) | The average number of write IOPS per second for the client during a statistical period. | |
ClientReadLatency | Average Client Read Latency for the Intelligent Computing Edition | Microsecond (μs) | The average read latency for the client during a statistical period. | |
ClientWriteLatency | Average Client Write Latency of the Intelligent Computing Edition | us | The average write latency for the client during a statistical period. | |
ClientMetaLatency | Intelligent Computing Edition: Client metadata latency | ms | The average latency for a client to complete a single metadata operation. | |
ClientMetaQps | Intelligent Computing Edition: Client metadata QPS | Count/s (IOPS) | The average number of metadata requests per second for the client during a statistical period. | |
Connections | VpcClientCount | Number of clients per Intelligent Computing Edition VPC | Unit | The total number of clients connected to the file system through a VPC. |
RdmaClientCount | Number of RDMA clients for the Intelligent Computing Edition | Unit | The total number of clients connected to the file system through RDMA. |
The elastic file client is a client installed by the CPFS team on compute nodes. It connects the compute nodes to the CPFS for Lingjun file system.
You can view client performance only in the CloudMonitor console or by calling CloudMonitor API operations. For more information, see View CPFS performance monitoring or View CPFS performance monitoring.
When using a CPFS for Lingjun file system on ECS or PAI Lingjun AI Computing Service (single-tenant) resources, the hostname is the hostname of the node.
When using a CPFS for Lingjun file system on PAI general computing resources or Lingjun resources, the hostname is the pod ID of the task.
For more information about the new monitoring metrics, see CloudMonitor Metric Query.
Old version metrics
Capacity monitoring
Type | Metric | Metric name | Unit | Description |
File system | CPFSCapacity | Total storage space | Bytes | The total storage space of the file system during a statistical period. |
CPFSCapacityUsed | Data volume | Bytes | The amount of data that is actually used by the file system during a statistical period. | |
CPFSInode Limit | Maximum number of files | Unit | The maximum number of files that can be used by the file system during a statistical period. | |
CPFSInode Alloc | Number of allocated files | Unit | The number of files that are allocated by the file system during a statistical period. | |
CPFSInode Used | Number of used files | Unit | The number of files that are used by the file system during a statistical period. | |
Fileset | BMCPFSFsetCapacityLimit | Fileset allocated capacity | Bytes | The maximum storage space that a fileset can use to write data. After the quota is reached, no more data can be written. |
BMCPFSFsetCapacityUsed | Fileset used capacity | Bytes | The storage space that is actually used by the fileset. | |
BMCPFSFsetInodeLimit | Number of files allocated by fileset | Item | The maximum number of files and directories that a fileset can use to write data. After the quota is reached, no more data can be written. | |
BMCPFSFsetInodeUsed | Number of files used by fileset | Unit | The number of files that are actually used by the fileset. |
Performance monitoring
Type | Metric | Metric name | Unit | Description |
File system | ThruputRead | Read throughput | Bytes/s | The average read throughput of the file system in bytes per second during a statistical period. |
ThruputWrite | Write throughput | Bytes/s | The average write throughput of the file system in bytes per second during a statistical period. | |
IopsRead | Read IOPS | Count/s | The average number of read IOPS per second for the file system during a statistical period. | |
IopsWrite | Write IOPS | Count/s | The average number of write IOPS per second for the file system during a statistical period. | |
Dataflow | ThroughputImport | Import throughput | Bytes/s | The average throughput in bytes per second for a dataflow import task during a statistical period. |
ThroughputExport | Export throughput | Bytes/s | The average throughput in bytes per second for a dataflow export task during a statistical period. | |
QPSImportMeta | Import metadata QPS | Count/s | The average number of metadata requests per second for a dataflow import task during a statistical period. | |
QPSExportMeta | Export metadata QPS | Count/s | The average number of metadata requests per second for a dataflow export task during a statistical period. | |
IOPSImport | Import IOPS | Count/s | The average number of IOPS per second for a dataflow import task during a statistical period. | |
IOPSEXport | Export IOPS | Count/s | The average number of IOPS per second for a dataflow export task during a statistical period. | |
LatencyImport | Import latency | us | The average latency of a dataflow import task during a statistical period. | |
LatencyExport | Export latency | us | The average latency of a dataflow export task during a statistical period. | |
Client | ClientReadIops | Client read IOPS | Count/s | The average number of read IOPS per second for the client during a statistical period. |
ClientWriteIops | Client write IOPS | Count/s | The average number of write IOPS per second for the client during a statistical period. | |
ClientReadLatency | Client average read latency | us | The average read latency for the client during a statistical period. | |
ClientWriteLatency | Client average write latency | us | The average write latency for the client during a statistical period. | |
ClientReadThroughput | Client read throughput | Bytes/s | The average read throughput in bytes per second for the client during a statistical period. | |
ClientWriteThroughput | Client write throughput | Bytes/s | The average write throughput in bytes per second for the client during a statistical period. |
The elastic file client is a client installed by the CPFS team on compute nodes. It connects the compute nodes to the CPFS for Lingjun file system.
You can view client performance only in the CloudMonitor console or by calling CloudMonitor API operations. For more information, see View CPFS performance monitoring or View CPFS performance monitoring.
When using a CPFS for Lingjun file system on ECS or PAI Lingjun AI Computing Service (single-tenant) resources, the hostname is the hostname of the node.
When using a CPFS for Lingjun file system on PAI general computing resources or Lingjun resources, the hostname is the pod ID of the task.
For more information about the old monitoring metrics, see CloudMonitor Metric Query.
Alert rule description
In the CloudMonitor console, you can set alert rules for different metrics. If a metric for a resource meets the specified alert condition, CloudMonitor automatically sends an alert notification. The following table describes the alert levels, notification mechanisms, and alert conditions.
Alert level | Notification mechanism | Alert condition |
Critical | Phone call, text message, email, and DingTalk Robot | The average value of the metric meets the specified judgment condition for N consecutive statistical periods. Set the value of N based on the alert level. Note The alert condition varies based on the selected metric type. The condition displayed on the interface prevails. |
Warning | Text message, email, and DingTalk Robot | |
Info | Email and DingTalk Robot |