Enable AI Storage with CPFS Dynamic Volumes in ACK - Container Service for Kubernetes

Limitations

Review these constraints before you begin. Violating them leads to mount failures or unrecoverable cluster state.

Same hpn-zone required for VSC mounting: The node running the pod must be in the same hpn-zone as the CPFS for Lingjun file system instance.
Node initialization: A Lingjun node must be associated with a CPFS for Lingjun file system during initialization. If this step was skipped, CSI mounting fails.
One file system per pod: Do not mount multiple volumes from the same CPFS for Lingjun file system in a single pod — for example, multiple PVs created by a StorageClass containing the same bmcpfsId. The native protocol does not support mounting the same file system instance multiple times within a single pod, even to different subdirectories.
Drain before taking a node offline: Before taking a Lingjun node offline due to failure, drain all pods from it. Skipping this step leaves behind unrecoverable pod resources and causes inconsistent cluster metadata.

Prerequisites

Before you begin, make sure you have:

A cluster running version 1.26 or later. To upgrade, see Manually upgrade a cluster.
Nodes running Alibaba Cloud Linux 3.

The following storage components installed and meeting the minimum version requirements. Go to the Add-ons page to check versions, install, or upgrade components.

Component	Minimum version
CSI add-on (csi-plugin and csi-provisioner)	v1.33.1
cnfs-nas-daemon add-on	0.1.2
bmcpfs-csi component (bmcpfs-csi-controller and bmcpfs-csi-node)	1.35.1

Configure cnfs-nas-daemon resources

The cnfs-nas-daemon add-on manages Elastic File Client (EFC) processes. It consumes significant resources and directly affects storage performance. Set its resource configuration on the Add-ons page using these guidelines:

CPU: Allocate 0.5 core per 1 Gb/s of bandwidth, plus 1 extra core for metadata management. Example: For a node with a 100 Gb/s NIC, set the CPU request to 100 × 0.5 + 1 = 51 cores.
Memory: Set the memory request to 15% of the node's total memory. CPFS for Lingjun uses FUSE, so data caching and file metadata both consume memory.

After adjusting the configuration, scale resources up or down based on actual workload.

Important

The cnfs-nas-daemon DaemonSet uses the OnDelete update strategy by default. After changing CPU or memory settings on the Add-ons page, manually delete the existing cnfs-nas-daemon pod on each node to trigger a rebuild and apply the new settings. Perform this during off-peak hours.

Nodes without hot upgrade support: This causes a hardware interrupt. Application pods fail and require manual deletion. After deletion, they restart and recover automatically.
Nodes with hot upgrade support: Application pods recover automatically after the cnfs-nas-daemon pod restarts. A node supports hot upgrades when all three conditions are met: kernel version 5.10.134-18 or later, bmcpfs-csi-controller and bmcpfs-csi-plugin versions 1.35.1 or later, and cnfs-nas-daemon version 0.1.9-compatible.1 or later.

Step 1: Create a CPFS file system

Create a CPFS for Lingjun file system. See Create a CPFS for Lingjun file system. Record the file system ID.
(Optional) If you want to mount on non-Lingjun nodes, create a VPC mount target in the same VPC as your cluster nodes, and record the mount target domain name. The format is cpfs-*-vpc-*.<Region>.cpfs.aliyuncs.com.

If all pods schedule to Lingjun nodes, VSC (Virtual Storage Controller) mounting is used by default and this step is not required.

Also review Limits for CPFS for Lingjun before proceeding.

Step 2: Create a StorageClass

Create a StorageClass object to define the storage template for dynamic provisioning.

Create sc.yaml with the following content:

Parameter	Required	Description
`bmcpfsId`	Yes	CPFS for Lingjun file system ID. Format: `bmcpfs-xxxxxxxxx` or `cpfs-xxxxxxxxx`.
`path`	No	Subdirectory within the file system. When specified, the volume is created under `{path}/{volumeName}/`. When omitted, the volume is created under `/{volumeName}/`.
`allowVolumeExpansion`	No	Reserved parameter. The current version does not support dynamic expansion.
`reclaimPolicy`	No	`Delete` (default): automatically deletes the fileset in the backend file system when you delete the PVC. `Retain`: keeps the fileset when you delete the PVC; you must clean up manually. Use `Retain` in production environments.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: alicloud-bmcpfs-test
provisioner: bmcpfsplugin.csi.alibabacloud.com
parameters:
  # Required: CPFS for Lingjun file system ID (format: bmcpfs-xxxxxxxxx or cpfs-xxxxxxxxx)
  bmcpfsId: bmcpfs-29000z8xz3lf5nj*****
  # Optional: subdirectory within the file system; volume creates under {path}/{volumeName}/
  # path: "/shared"
# Reserved parameter — current version does not support dynamic expansion
allowVolumeExpansion: true
# Delete (default): automatically removes the fileset when the PVC is deleted
# Retain (recommended for production): keeps the fileset; you must clean up manually
reclaimPolicy: Delete

Apply the StorageClass:

kubectl apply -f sc.yaml

The expected output is:

storageclass.storage.k8s.io/alicloud-bmcpfs-test created

Step 3: Create a PVC

Applications request storage through a persistent volume claim (PVC), which references the StorageClass as a provisioning template.

Create pvc.yaml with the following content:

Parameter	Description
`accessModes`	Only `ReadWriteMany` is supported, allowing multiple pods to mount and read/write simultaneously.
`storage`	Requested storage capacity. Supports units such as Gi and Ti.
`volumeMode`	Only `Filesystem` is supported.
`storageClassName`	The StorageClass to use. Specifying this field triggers dynamic volume creation.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: bmcpfs-vsc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany  # Supports concurrent reads and writes across multiple pods
  resources:
    requests:
      storage: 10Ti  # Supports large-capacity storage (Ti level)
  volumeMode: Filesystem  # Only Filesystem is supported
  storageClassName: alicloud-bmcpfs-test  # Must match the StorageClass created in Step 2

Apply the PVC:
```
kubectl apply -f pvc.yaml
```

Verify the PVC is bound:

kubectl get pvc bmcpfs-vsc -n default

The expected output is:

NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS           AGE
bmcpfs-vsc   Bound    pvc-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx   10Ti       RWX            alicloud-bmcpfs-test   30s

When STATUS is Bound, the system has automatically created the corresponding PV. To confirm provisioning succeeded, run:

kubectl describe pvc bmcpfs-vsc -n default

In the Events section, look for a Provisioning succeeded message.

Step 4: Deploy a workload and mount the PVC

After the PVC is bound, deploy a workload that mounts the volume.

Create deploy.yaml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpfs-shared-example
spec:
  replicas: 3  # Three replicas verify that shared storage works across multiple pods
  selector:
    matchLabels:
      app: cpfs-shared-app
  template:
    metadata:
      labels:
        app: cpfs-shared-app
    spec:
      tolerations:
        - key: node-role.alibabacloud.com/lingjun
          operator: Exists
          effect: NoSchedule
      # Optional: to pin all pods to a specific node, uncomment and set the node name
      # nodeName: cn-hangzhou.10.XX.XX.226
      containers:
      - name: app-container
        image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
        volumeMounts:
          - name: pvc-cpfs
            mountPath: /data  # Shared volume mounted at /data inside the container
        lifecycle:
          postStart:
            exec:
              command:
                - /bin/sh
                - -c
                - >
                  echo "Data written by $(hostname)" > /data/$(hostname).txt &&
                  echo "Deployment is running, check shared data in /data." &&
                  sleep 3600
      volumes:
        - name: pvc-cpfs
          persistentVolumeClaim:
            claimName: bmcpfs-vsc  # References the PVC created in Step 3

Apply the Deployment:

kubectl apply -f deploy.yaml

The expected output is:

deployment.apps/cpfs-shared-example created

Clean up resources

To avoid unexpected costs and ensure data safety, delete resources in the following order.

Delete workloads — Stop all applications using the relevant PVCs. This unmounts the volumes.
```
kubectl delete deployment <your-deployment-name>
```
Delete PVCs — The outcome depends on the reclaimPolicy of the StorageClass:
- Retain (recommended): After deleting the PVC, the CPFS for Lingjun fileset and its data remain intact. Proceed to Step 3 to clean up the PV.
- Delete: Deleting the PVC permanently deletes its bound PV and the backend fileset. This operation is irreversible.
```
kubectl delete pvc <your-pvc-name>
```
Delete PVs (only when reclaimPolicy is Retain) — After you delete the PVC, the PV transitions to Released status. Delete the PV to remove the resource definition from Kubernetes. This does not affect backend data.
```
kubectl delete pv <your-pv-name>
```
(Optional) Delete the StorageClass — If you no longer need this storage configuration, delete the StorageClass. This does not affect already-created volumes.
```
kubectl delete sc <your-sc-name>
```
Delete the CPFS for Lingjun file system — This permanently deletes all data on the file system, including data retained by the Retain policy. Confirm the file system has no remaining dependencies before proceeding. See Delete a file system.

Troubleshooting

PVC stays in Pending status

A PVC stuck in Pending means dynamic provisioning failed. Start with the PVC events — they usually identify the cause directly.

kubectl describe pvc <your-pvc-name> -n <your-namespace>

Check the Events section for warning messages. Common causes:

StorageClass not found: The storageClassName field is incorrect, or the StorageClass does not exist.
provisioning failed or failed to create fileset: There is an issue interacting with the backend storage. Continue with the steps below.

If the events point to a configuration issue, inspect the StorageClass and verify the CSI driver is registered:

# Check the StorageClass configuration
kubectl get storageclass <your-sc-name> -o yaml

# Verify the CSI driver is registered
kubectl get csidriver bmcpfsplugin.csi.alibabacloud.com

Confirm that:

The provisioner field matches bmcpfsplugin.csi.alibabacloud.com.
The bmcpfsId parameter is correctly set and the file system ID exists.
If get csidriver returns no output or an error, the driver is not installed. Install bmcpfs-csi-controller, bmcpfs-csi-node, and cnfs-nas-daemon from the cluster's Add-ons page.

Pod stays in ContainerCreating or MountVolume.Setup failed

This error means the pod was scheduled to a node but the volume mount failed. Follow these steps to isolate the cause.

Check pod events for the specific failure:
```
kubectl describe pod <pod-name> -n <your-namespace>
```
In the Events section, look for Warning messages such as FailedMount or MountVolume.Setup failed.
Confirm the PVC is bound. Pods can only mount successfully bound volumes.
```
kubectl get pvc <your-pvc-name>
```
The STATUS must be Bound. If it is Pending, see PVC stays in Pending status.
If the PVC is bound, check the node-side CSI plugin logs for the lowest-level error:
```
kubectl get pods -n kube-system -l app.kubernetes.io/name=bmcpfs-csi-driver \
  --field-selector spec.nodeName=<nodeName> \
  -o name | xargs kubectl logs -n kube-system -c csi-plugin
```
These logs contain detailed error messages, including network connectivity issues, mount target permission errors, and underlying I/O errors.