Dynamic volume provisioning automates on-demand storage for CPFS for Lingjun, eliminating manual persistent volume (PV) management. Because CPFS for Lingjun supports concurrent reads and writes across multiple pods, it is well suited for AI training and data analytics workloads that share code, configuration files, and intermediate computation results.
Limitations
Review these constraints before you begin. Violating them leads to mount failures or unrecoverable cluster state.
-
Same hpn-zone required for VSC mounting: The node running the pod must be in the same hpn-zone as the CPFS for Lingjun file system instance.
-
Node initialization: A Lingjun node must be associated with a CPFS for Lingjun file system during initialization. If this step was skipped, CSI mounting fails.
-
One file system per pod: Do not mount multiple volumes from the same CPFS for Lingjun file system in a single pod — for example, multiple PVs created by a StorageClass containing the same
bmcpfsId. The native protocol does not support mounting the same file system instance multiple times within a single pod, even to different subdirectories. -
Drain before taking a node offline: Before taking a Lingjun node offline due to failure, drain all pods from it. Skipping this step leaves behind unrecoverable pod resources and causes inconsistent cluster metadata.
Prerequisites
Before you begin, make sure you have:
-
A cluster running version 1.26 or later. To upgrade, see Manually upgrade a cluster.
-
Nodes running Alibaba Cloud Linux 3.
-
The following storage components installed and meeting the minimum version requirements. Go to the Add-ons page to check versions, install, or upgrade components.
Component Minimum version CSI add-on (csi-plugin and csi-provisioner) v1.33.1 cnfs-nas-daemon add-on 0.1.2 bmcpfs-csi component (bmcpfs-csi-controller and bmcpfs-csi-node) 1.35.1
Configure cnfs-nas-daemon resources
The cnfs-nas-daemon add-on manages Elastic File Client (EFC) processes. It consumes significant resources and directly affects storage performance. Set its resource configuration on the Add-ons page using these guidelines:
-
CPU: Allocate 0.5 core per 1 Gb/s of bandwidth, plus 1 extra core for metadata management. Example: For a node with a 100 Gb/s NIC, set the CPU request to
100 × 0.5 + 1 = 51cores. -
Memory: Set the memory request to 15% of the node's total memory. CPFS for Lingjun uses FUSE, so data caching and file metadata both consume memory.
After adjusting the configuration, scale resources up or down based on actual workload.
The cnfs-nas-daemon DaemonSet uses the OnDelete update strategy by default. After changing CPU or memory settings on the Add-ons page, manually delete the existing cnfs-nas-daemon pod on each node to trigger a rebuild and apply the new settings. Perform this during off-peak hours.
-
Nodes without hot upgrade support: This causes a hardware interrupt. Application pods fail and require manual deletion. After deletion, they restart and recover automatically.
-
Nodes with hot upgrade support: Application pods recover automatically after the cnfs-nas-daemon pod restarts. A node supports hot upgrades when all three conditions are met: kernel version 5.10.134-18 or later, bmcpfs-csi-controller and bmcpfs-csi-plugin versions 1.35.1 or later, and cnfs-nas-daemon version 0.1.9-compatible.1 or later.
Step 1: Create a CPFS file system
-
Create a CPFS for Lingjun file system. See Create a CPFS for Lingjun file system. Record the file system ID.
-
(Optional) If you want to mount on non-Lingjun nodes, create a VPC mount target in the same VPC as your cluster nodes, and record the mount target domain name. The format is
cpfs-*-vpc-*.<Region>.cpfs.aliyuncs.com.If all pods schedule to Lingjun nodes, VSC (Virtual Storage Controller) mounting is used by default and this step is not required.
Also review Limits for CPFS for Lingjun before proceeding.
Step 2: Create a StorageClass
Create a StorageClass object to define the storage template for dynamic provisioning.
-
Create
sc.yamlwith the following content:Parameter Required Description bmcpfsIdYes CPFS for Lingjun file system ID. Format: bmcpfs-xxxxxxxxxorcpfs-xxxxxxxxx.pathNo Subdirectory within the file system. When specified, the volume is created under {path}/{volumeName}/. When omitted, the volume is created under/{volumeName}/.allowVolumeExpansionNo Reserved parameter. The current version does not support dynamic expansion. reclaimPolicyNo Delete(default): automatically deletes the fileset in the backend file system when you delete the PVC.Retain: keeps the fileset when you delete the PVC; you must clean up manually. UseRetainin production environments.apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: alicloud-bmcpfs-test provisioner: bmcpfsplugin.csi.alibabacloud.com parameters: # Required: CPFS for Lingjun file system ID (format: bmcpfs-xxxxxxxxx or cpfs-xxxxxxxxx) bmcpfsId: bmcpfs-29000z8xz3lf5nj***** # Optional: subdirectory within the file system; volume creates under {path}/{volumeName}/ # path: "/shared" # Reserved parameter — current version does not support dynamic expansion allowVolumeExpansion: true # Delete (default): automatically removes the fileset when the PVC is deleted # Retain (recommended for production): keeps the fileset; you must clean up manually reclaimPolicy: Delete -
Apply the StorageClass:
kubectl apply -f sc.yamlThe expected output is:
storageclass.storage.k8s.io/alicloud-bmcpfs-test created
Step 3: Create a PVC
Applications request storage through a persistent volume claim (PVC), which references the StorageClass as a provisioning template.
-
Create
pvc.yamlwith the following content:Parameter Description accessModesOnly ReadWriteManyis supported, allowing multiple pods to mount and read/write simultaneously.storageRequested storage capacity. Supports units such as Gi and Ti. volumeModeOnly Filesystemis supported.storageClassNameThe StorageClass to use. Specifying this field triggers dynamic volume creation. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: bmcpfs-vsc namespace: default spec: accessModes: - ReadWriteMany # Supports concurrent reads and writes across multiple pods resources: requests: storage: 10Ti # Supports large-capacity storage (Ti level) volumeMode: Filesystem # Only Filesystem is supported storageClassName: alicloud-bmcpfs-test # Must match the StorageClass created in Step 2 -
Apply the PVC:
kubectl apply -f pvc.yaml -
Verify the PVC is bound:
kubectl get pvc bmcpfs-vsc -n defaultThe expected output is:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE bmcpfs-vsc Bound pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx 10Ti RWX alicloud-bmcpfs-test 30sWhen
STATUSisBound, the system has automatically created the corresponding PV. To confirm provisioning succeeded, run:kubectl describe pvc bmcpfs-vsc -n defaultIn the
Eventssection, look for aProvisioning succeededmessage.
Step 4: Deploy a workload and mount the PVC
After the PVC is bound, deploy a workload that mounts the volume.
-
Create
deploy.yamlwith the following content:apiVersion: apps/v1 kind: Deployment metadata: name: cpfs-shared-example spec: replicas: 3 # Three replicas verify that shared storage works across multiple pods selector: matchLabels: app: cpfs-shared-app template: metadata: labels: app: cpfs-shared-app spec: tolerations: - key: node-role.alibabacloud.com/lingjun operator: Exists effect: NoSchedule # Optional: to pin all pods to a specific node, uncomment and set the node name # nodeName: cn-hangzhou.10.XX.XX.226 containers: - name: app-container image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 volumeMounts: - name: pvc-cpfs mountPath: /data # Shared volume mounted at /data inside the container lifecycle: postStart: exec: command: - /bin/sh - -c - > echo "Data written by $(hostname)" > /data/$(hostname).txt && echo "Deployment is running, check shared data in /data." && sleep 3600 volumes: - name: pvc-cpfs persistentVolumeClaim: claimName: bmcpfs-vsc # References the PVC created in Step 3 -
Apply the Deployment:
kubectl apply -f deploy.yamlThe expected output is:
deployment.apps/cpfs-shared-example created
Clean up resources
To avoid unexpected costs and ensure data safety, delete resources in the following order.
-
Delete workloads — Stop all applications using the relevant PVCs. This unmounts the volumes.
kubectl delete deployment <your-deployment-name> -
Delete PVCs — The outcome depends on the
reclaimPolicyof the StorageClass:-
Retain(recommended): After deleting the PVC, the CPFS for Lingjun fileset and its data remain intact. Proceed to Step 3 to clean up the PV. -
Delete: Deleting the PVC permanently deletes its bound PV and the backend fileset. This operation is irreversible.
kubectl delete pvc <your-pvc-name> -
-
Delete PVs (only when
reclaimPolicyisRetain) — After you delete the PVC, the PV transitions toReleasedstatus. Delete the PV to remove the resource definition from Kubernetes. This does not affect backend data.kubectl delete pv <your-pv-name> -
(Optional) Delete the StorageClass — If you no longer need this storage configuration, delete the StorageClass. This does not affect already-created volumes.
kubectl delete sc <your-sc-name> -
Delete the CPFS for Lingjun file system — This permanently deletes all data on the file system, including data retained by the
Retainpolicy. Confirm the file system has no remaining dependencies before proceeding. See Delete a file system.
Troubleshooting
PVC stays in Pending status
A PVC stuck in Pending means dynamic provisioning failed. Start with the PVC events — they usually identify the cause directly.
kubectl describe pvc <your-pvc-name> -n <your-namespace>
Check the Events section for warning messages. Common causes:
-
StorageClass not found: ThestorageClassNamefield is incorrect, or the StorageClass does not exist. -
provisioning failedorfailed to create fileset: There is an issue interacting with the backend storage. Continue with the steps below.
If the events point to a configuration issue, inspect the StorageClass and verify the CSI driver is registered:
# Check the StorageClass configuration
kubectl get storageclass <your-sc-name> -o yaml
# Verify the CSI driver is registered
kubectl get csidriver bmcpfsplugin.csi.alibabacloud.com
Confirm that:
-
The
provisionerfield matchesbmcpfsplugin.csi.alibabacloud.com. -
The
bmcpfsIdparameter is correctly set and the file system ID exists. -
If
get csidriverreturns no output or an error, the driver is not installed. Install bmcpfs-csi-controller, bmcpfs-csi-node, and cnfs-nas-daemon from the cluster's Add-ons page.
Pod stays in ContainerCreating or MountVolume.Setup failed
This error means the pod was scheduled to a node but the volume mount failed. Follow these steps to isolate the cause.
-
Check pod events for the specific failure:
kubectl describe pod <pod-name> -n <your-namespace>In the
Eventssection, look forWarningmessages such asFailedMountorMountVolume.Setup failed. -
Confirm the PVC is bound. Pods can only mount successfully bound volumes.
kubectl get pvc <your-pvc-name>The
STATUSmust beBound. If it isPending, see PVC stays in Pending status. -
If the PVC is bound, check the node-side CSI plugin logs for the lowest-level error:
kubectl get pods -n kube-system -l app.kubernetes.io/name=bmcpfs-csi-driver \ --field-selector spec.nodeName=<nodeName> \ -o name | xargs kubectl logs -n kube-system -c csi-pluginThese logs contain detailed error messages, including network connectivity issues, mount target permission errors, and underlying I/O errors.