How to identify the issues caused by I/O operations on the NAS client - Container Service for Kubernetes

You can analyze the data displayed on Container Network File System (CNFS) dashboards to identify the issues that are caused by I/O operations on clients. For example, frequent I/O operations on clients may consume a large amount of bandwidth resources. This topic describes how to use the observabilities of CNFS to identify the issues caused by I/O operations on clients.

Prerequisites
Storage dashboards
View storage dashboards
Identify the issues caused by frequent I/O operations of pods
References

Prerequisites

A Container Service for Kubernetes (ACK) managed cluster is created. The Kubernetes version must be later than 1.20 and the Container Storage Interface (CSI) plug-in must be installed. For more information, see Create an ACK managed cluster.
The versions of csi-plugin and csi-provisioner are 1.24.9-74f8490-aliyun or later. For more information about how to update csi-plugin and csi-provisioner, see Install and upgrade the CSI plug-in.
An Apsara File Storage NAS (NAS) file system is mounted by using the CNFS client. For more information, see Enable the distributed caching feature of the CNFS client.

Storage dashboards

Dashboard	Description
Frontend Storage IO Monitoring (Cluster Level)	This dashboard shows the key metrics about access to the CNFS client. You can filter NAS file systems by ID.
Backend Storage IO Monitoring (Cluster Level)	This dashboard shows the key metrics about access to the NAS file system. You can filter NAS file systems by ID.
Container Storage IO Monitoring (Cluster Level)	This dashboard shows the key metrics about the top N pods that are accessed.
Pod IO Monitoring (Pod Level)	This dashboard shows the key metrics about each pod that is accessed. You can filter pods by name.

The metrics on volume dashboards are custom metrics. For more information about the billing rules of custom metrics, see Billing.

View storage dashboards

Log on to the ACK console and click Clusters in the left-side navigation pane.
On the Clusters page, click the name of the cluster that you want to manage. In the left-side navigation pane, choose Operations > Prometheus Monitoring.
On the Prometheus Monitoring page, click the Storage Monitoring tab.
- On the Storage Monitoring tab, click Frontend Storage IO Monitoring (Cluster Level) to view the dashboard that shows the key metrics about access to the CNFS client.
- On the Storage Monitoring tab, click Backend Storage IO Monitoring (Cluster Level) to view the dashboard that shows the key metrics about access to the NAS file system.
- On the Storage Monitoring tab, click Container Storage IO Monitoring (Cluster Level) to view the dashboard that shows the key metrics about the top N pods that are accessed.
- On the Storage Monitoring tab, click Pod IO Monitoring (Pod Level) to view the dashboard that shows the key metrics about each pod that is accessed.

Identify the issues caused by frequent I/O operations of pods

Read operations on a NAS volume are analyzed in the following examples. You can perform similar steps to analyze read operations on CPFS volumes.

Issue 1: How do I view the metrics about read operations on a NAS volume?

Go to the Frontend Storage IO Monitoring (Cluster Level) dashboard and set the following filters:

client_name: Select eac.
backend_storage: Select nas.
bucket_name: Select the ID of the NAS file system.

new POSIX

The following table describes the aggregated I/O information about the selected NAS volume.

IOPS	ThroughPut	POSIX requests
13 k/s	50 MB/s	14k count/s
4 k/s	18 MB/s	4k count/s

Issue 2: What applications and volumes may slow down the system if the I/O operations on the applications and volumes become frequent?

If the number of read requests to a persistent volume claim (PVC) that is mounted to your application sharply increases, the application may be throttled or stop responding. To identify the issue, perform the following steps:

Identify the pods that are frequently accessed.
Go to the Container Storage IO Monitoring (Cluster Level) dashboard. In the TopN_Pod_IOPS(IO/s) and TopN_Pod_Throughput panels, you can sort the pods based on the read column to view the pod that has the most frequent I/O operations and the pod that has the highest throughput.
The figure shows that pods whose names start with eac-read-test-sts have frequent I/O operations and high throughput, and the eac-read-test-sts-0 pod has the most frequent I/O operations and highest throughput.
Identify the volumes that are frequently accessed.
Go to the Container Storage IO Monitoring (Cluster Level) dashboard. In the TopN_PV_IOPS(IO/s) and TopN_PV_Throughput panels, you can sort the pods based on the read column to view the pod that has the most frequent I/O operations and the persistent volume (PV) that has the highest throughput.
The figure shows that the eac-read-test-sts-0 pod to which the nas-e5e6f5dd-35a5-4808-89a8-94ac4fbe6534 PV is mounted has the most frequent I/O operations and highest throughput.
Identify the pods that have frequent Portable Operating System Interface (POSIX) operations.
Go to the Pod IO Monitoring (Pod Level) dashboard. Select the eac-read-test-sts-0 pod. In the Throughput, IOPS, and POSIX Operation(count/s) panels, you can view the number of POSIX operations and throughput of each pod.
The figure shows that the eac-read-test-sts-0 pod to which the nas-e5e6f5dd-35a5-4808-89a8-94ac4fbe6534 PV is mounted has the highest IOPS and has about 7,920 Read POSIX operations per second.
You can modify the pod configurations to resolve the issues caused by frequent metadata access.
1. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Pods in the left-side navigation pane.
2. On the Pods page, click the StatefulSet named eac-test-ls-sts. On the details page, you can obtain the image used to deploy the application and also modify the application configurations.

Issue 3: What metadata may slow down the system if the requests to the metadata become frequent?

If the number of read requests to the metadata in a NAS file system sharply increases, the application may be throttled or stop responding. To identify the issue, perform the following steps:

Identify the applications that frequently access metadata.
Go to the Frontend Storage IO Monitoring (Cluster Level) dashboard and set the following filters. Then, you can view the value of the readdir metric in the Aggregated POSIX Operation (count/s) panel.
- client_name: Select eac.
- backend_storage: Select nas.
- bucket_name: Select the ID of the NAS file system.
The figure shows that 4,480 readdir requests are sent to the eac client per second.
Identify the volumes that are frequently accessed.
Go to the Container Storage IO Monitoring (Cluster Level) dashboard. In the TopN_Pod_Meta_Operation and TopN_PV_Meta_Operation panels, you can use the rate filter to sort the readdir metric in descending order.
The figure shows that the pod whose name starts with eac-test-ls-sts has the most traversing operations on the cnfs-eac-static-pv PV that is mounted to the pod.
Identify frequent I/O operations that are performed on frequently accessed pods.
Go to the Pod IO Monitoring (Pod Level) dashboard. Separately set Pod to eac-test-ls-sts-0 and eac-test-ls-sts-1 to view the I/O metrics for each pod in the POSIX Operation (count/s) panel.
You can modify the pod configurations to resolve the issues caused by frequent metadata access.
1. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Pods in the left-side navigation pane.
2. On the Pods page, click the StatefulSet named eac-test-ls-sts. On the details page, you can obtain the image used to deploy the application and also modify the application configurations.

Container Service for Kubernetes:Use the observabilities of CNFS to identify the issues caused by I/O operations on the NAS client

Table of contents