All Products
Search
Document Center

Container Service for Kubernetes:Prefill OSS data on demand to high-performance volumes

Last Updated:Mar 30, 2026

Before starting tasks such as AI training or data analytics, prefetch large volumes of cold data stored in Object Storage Service (OSS) on demand into high-performance storage volumes—such as CPFS for Lingjun or cloud disks. Compute tasks can then read data directly from these high-performance volumes at high speed. After the task completes, the storage volume is automatically reclaimed, balancing compute acceleration with cost optimization.

How it works

Feature implementation

This feature uses Kubernetes Volume Populators and is managed by ACK's storage-operator. When you create a persistent volume claim (PVC) that references the custom resource OSSVolumePopulator (OSSVP), storage-operator intercepts the request and performs the data population operation.

Depending on the target volume type, population occurs in one of the following modes.

Mode

Supported storage types

Description

bmcpfs-dataflow
Native data flow mode

CPFS for Lingjun

storage-operator leverages CPFS's native data flow capability to populate data.

This mode does not consume cluster compute resources and offers higher efficiency.

image

generic
Generic pod population mode

Other storage types, such as cloud disks or CPFS general-purpose edition

storage-operator creates a temporary pod in the ack-volume-populator namespace to mount the newly created volume and download the specified OSS data into it. After population completes, this pod is destroyed.

This mode consumes cluster compute resources.

image

After successful data population, the PVC status changes from Pending to Bound. At this point, application pods can mount the PVC and access the prefetched data.

Typical scenarios

This feature supports two primary use cases.

Dimension

Scenario 1: Prefetch data to CPFS for Lingjun shared volumes

Scenario 2: Prefetch data to isolated cloud disk volumes

Applicable scenario

High-throughput, read-intensive workloads such as AI model training and inference, designed to overcome OSS access performance bottlenecks.

Parallel batch processing or data pipeline tasks requiring isolated, read-write workspaces, designed to resolve concurrency conflicts and ensure data isolation.

Technical implementation

Use CPFS for Lingjun with the bmcpfs-dataflow mode for native data population.

Use any dynamically provisioned storage such as cloud disks with the generic mode, using a temporary pod for generic data population.

Key characteristics

  • Cost optimization: Leverages native capabilities for acceleration without consuming compute resources. Storage is released immediately after use.

  • Data sharing: Supports many-to-one access. Multiple GPU tasks can share the same prefetched dataset.

  • Data isolation and flexibility: Each task gets its own dedicated storage volume, avoiding interference. Compatible with multiple dynamic storage types.

  • Elastic scaling: Supports one-to-one access. Storage lifecycle is tied to the task lifecycle, enabling precise cost control.

Workflow

The core steps for using VolumePopulator are similar across both modes.

image

  1. Environment preparation: Enable the VolumePopulator Feature Gate in storage-operator.

    After you enable VolumePopulator, the ack-volume-populator namespace is created by default to run the temporary PVCs and pods generated during data prefill.
  2. Permission configuration: Grant OSS access permissions for the population task. The bmcpfs-dataflow mode requires specific tags on the OSS bucket. The generic mode requires RRSA or AccessKey configuration.

  3. Define data source (OSSVP): Create an OSSVolumePopulator object to specify the OSS bucket path and population mode.

  4. Create volume and trigger prefill (PVC): Create a PVC, specify a StorageClass, and use the dataSourceRef field to reference the previously created OSSVP. This starts the data prefill process (timing depends on the StorageClass volumeBindingMode).

  5. Validation and usage: After the PVC status becomes Bound, create application workloads (pods, deployments, etc.) that mount this PVC for high-speed data access.

    Data prefilling is a one-time operation during volume creation. Any subsequent changes to the OSS source data will not sync to the created volume.
  6. Resource reclamation: After the task ends, delete the PVC. If the StorageClass reclaimPolicy is set to Delete, the associated high-performance storage resources (such as CPFS FileSets or cloud disks) are automatically deleted and billing stops. The OSS source data remains unaffected.

Preparations

  • Ensure your cluster runs Kubernetes 1.26 or later and uses the CSI plugin. This feature relies on dynamically provisioned volumes and only supports dynamically provisioned storage.

    To upgrade your cluster, see Manually upgrade a cluster. To migrate from FlexVolume to CSI, see Migrate FlexVolume to CSI using csi-compatible-controller.
  • Upgrade storage-operator to v1.35.1 or later and enable the VolumePopulator Feature Gate.

    If other feature gates are already enabled, use the format: xxxxxx=true,yyyyyy=false,VolumePopulator=true

Scenario 1: Prefetch data to CPFS for Lingjun shared volumes

This solution targets read-only, high-throughput scenarios like model training and inference. It leverages CPFS for Lingjun's data flow capability to prefetch models from OSS into CPFS for Lingjun volumes on demand, enabling multiple GPU tasks to read data at high speed.

Preparations

1. Set a specific tag on the OSS bucket

Follow the instructions in Object tagging operations to add a tag to your OSS bucket with key cpfs-dataflow and value true.

Do not delete or modify this tag during use, as it may cause volume creation to fail.

2. Create OSSVolumePopulator (OSSVP)

Create an OSSVolumePopulator resource in the same namespace as your application and PVC to define the OSS data source.

apiVersion: storage.alibabacloud.com/v1beta1
kind: OSSVolumePopulator
metadata:
  name: qwen3-32b
  # Must be in the same namespace as the application and PVC
  namespace: bmcpfs-dataflow-demo 
spec:
  bucket: <your-bucket-name>
  region: cn-hangzhou
  endpoint: oss-cn-hangzhou-internal.aliyuncs.com
  path: /Qwen3-32B/
  # Dedicated for CPFS for Lingjun volumes, leveraging its data flow capability
  mode: bmcpfs-dataflow
  
  # Optional advanced configurations for bmcpfs-dataflow mode
  # This example recommends default settings. Ignore if no special requirements.
  # bmcpfsDataflow:
    # Maximum data flow throughput (MB/s). Options: 600, 1200, 1500. Default: 600
    # throughput: 1200
    # Enable encrypted transfer. Default: empty (disabled)
    # sourceSecurityType: SSL
    # Data prefill mode. Default: metadataAndData for full prefill.
    # Set to metadata for metadata-only prefill.
    # dataType: metadataAndData

Parameter descriptions:

Name

Description

Optional

Default

namespace

OSSVolumePopulator must reside in the same namespace as the application and PVC.

No

N/A

bucket

OSS bucket name.

No

N/A

region

OSS region.

No

N/A

endpoint

OSS service endpoint address.

No

N/A

path

Path prefix within the OSS bucket, such as /data/.

Yes

/

mode

Operation mode. Options:

  • generic (default): Generic population mode for any backend dynamic volume (e.g., cloud disks, CPFS general-purpose edition). Creates a temporary pod in the ack-volume-populator namespace to download data, consuming cluster compute resources.

  • bmcpfs-dataflow: High-performance mode exclusively for CPFS for Lingjun volumes. Uses native data flow for higher efficiency without consuming cluster compute resources.

  • auto: Automatically selects the best mode based on target storage type. However, it usually falls back to generic mode. Manually specify bmcpfs-dataflow if needed.

Yes

auto

bmcpfsDataflow.throughput

Maximum CPFS for Lingjun data flow throughput (MB/s). Options: 600, 1200, 1500.

Yes

Same as CPFS for Lingjun data flow default

bmcpfsDataflow.sourceSecurityType

Data transfer security protocol type, such as "SSL" to enable encrypted transfer.

Yes

Encryption disabled

bmcpfsDataflow.dataType

Specify data type to sync:

  • metadata: Sync file metadata only.

    With metadata-only sync, actual data is pulled from OSS on first read, which may affect initial access performance.

  • metadataAndData: Sync both metadata and file content.

Yes

metadataAndData

3. Prepare StorageClass and PVC

Create a StorageClass that references CPFS for Lingjun, then create a PVC and reference the previously created OSSVP using dataSourceRef.

Create StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: bmcpfs-dataflow-demo
parameters:
  bmcpfsId: bmcpfs-29000z8xz3xxxxxxxxxxx
  vpcMountTarget: cpfs-29000z8xz3xxxxxxxxxxx-vpc-xxxxxx.cn-wulanchabu.cpfs.aliyuncs.com
  mountpointAutoSwitch: "true"
provisioner: bmcpfsplugin.csi.alibabacloud.com
# Critical: Ensure CPFS FileSet and data are automatically cleaned up after PVC deletion
reclaimPolicy: Delete
# Recommended: Set to Immediate to start data prefill immediately after PVC creation
volumeBindingMode: Immediate

Each dynamic volume created through this StorageClass automatically creates a Fileset on the backend CPFS for Lingjun. By creating different OSSVP resources, you can prefill different OSS datasets into different dynamic volumes using the same StorageClass.

Create PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen3-32b
  namespace: bmcpfs-dataflow-demo
spec:
  accessModes:
    - ReadOnlyMany
  dataSourceRef:
    apiGroup: storage.alibabacloud.com
    kind: OSSVolumePopulator
    # Reference the OSSVP name created above
    name: qwen3-32b
  resources:
    requests:
      # Must be at least the size of the OSS data source referenced by OSSVP
      storage: 80Gi
  storageClassName: bmcpfs-dataflow-demo
  volumeMode: Filesystem

Because the StorageClass sets volumeBindingMode: Immediate, storage-operator immediately executes the data flow task from OSS to CPFS for Lingjun after PVC creation.

4. Verify data prefill status

Check data flow progress: During population, the PVC status remains Pending and changes to Bound after completion.

For bmcpfs-dataflow mode, you can also check real-time CPFS for Lingjun data flow progress using the following command.

kubectl -n bmcpfs-dataflow-demo describe ossvp qwen3-32b
  • status during population:

      Bmcpfs Dataflow:
        62a4e7ec-fae1-4f11-848f-b57cxxxxxxxx:
          Data Flow Id:       df-29d3ad9e9xxxxxxx
          Data Flow Task Id:  task-2993179xxxxxxxxx
          File Set Id:        fset-2997498xxxxxxxxx
          File System Id:     bmcpfs-29000z8xz3lf5xxxxxxxx
          Progress:           59%
  • status after completion:

    Message:  Populated successfully

5. Create workload and use data

After the PVC becomes Bound, create a workload that mounts this PVC.

This example uses GPU resources. For verification only, create a CPU pod and use kubectl exec to log into the container and inspect the data.

Expand to view sample code

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: demo-apply-qwen3-32b
  namespace: bmcpfs-dataflow-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo-apply-qwen3-32b
  template:
    metadata:
      labels:
        app: demo-apply-qwen3-32b
      # Use ACS HPN instance type to mount CPFS volume
      # alibabacloud.com/compute-class: "gpu-hpn"
    spec:
      # Use ACS HPN instance type to mount CPFS volume
      # nodeSelector: 
      #   alibabacloud.com/node-type: reserved
      # tolerations:
      # - key: "virtual-kubelet.io/provider" 
      #   operator: "Exists"
      #   effect: "NoSchedule"
      volumes:
        - name: model-storage
          persistentVolumeClaim:
            # Specify the created PVC
            claimName: qwen3-32b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
      containers:
        - command: ["sh", "-c", "python3 -m sglang.launch_server --model-path /models/Qwen3-32B --tp 2"]
          image: registry.cn-beijing.aliyuncs.com/tool-sys/ossfs-public:demo-env-python3.12.7-sglang0.5.5
          name: sglang
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: "2"
            requests:
              nvidia.com/gpu: "2"
          volumeMounts:
            - mountPath: /models/Qwen3-32B
              name: model-storage
            - mountPath: /dev/shm
              name: dshm

Resource cleanup guide

After completing AI training or inference tasks, promptly release the CPFS for Lingjun shared volumes and related workloads created for them.

Resources to release:

  • Workloads using the shared volume (in this example, a StatefulSet)

  • PVC used for data prefilling

  • Backend storage resources (CPFS for Lingjun FileSet) automatically created by the PVC

Cleanup procedure:

  1. Delete the workload

    Delete the StatefulSet using the volume to release the PVC.

    kubectl delete statefulset demo-apply-qwen3-32b -n bmcpfs-dataflow-demo
  2. Delete the PVC

    Because the StorageClass sets reclaimPolicy: Delete, this action automatically triggers deletion of the backend CPFS FileSet, releasing storage space and stopping billing.

    kubectl delete pvc qwen3-32b -n bmcpfs-dataflow-demo
  3. Verify resource cleanup:

    • Verify CPFS for Lingjun file system: Go to the NAS console, select File System > File System List, and confirm the FileSet associated with this PVC has been deleted and the file system's used capacity has decreased.

    • Verify OSS source data: This operation does not affect OSS source data. To verify, go to the OSS console and confirm the dataset remains intact.

Scenario 2: Prefetch data to isolated cloud disk volumes

This solution applies to batch processing workflows. Using Argo Workflows, it dynamically creates and preheats an independent cloud disk for each task, achieving data isolation and elasticity.

Preparations

  • Enable VolumePopulatorPodHandler in storage-operator.

    After enabling, the system automatically grants necessary RBAC permissions to related components and temporary pods. Evaluate potential security risks before enabling.

    Automatically configured RBAC permissions

    • Grant storage-operator pod operation permissions in the ack-volume-populator namespace

      - apiGroups: [""]
        resources: [pods]
        verbs: [get, list, watch, create, delete]
    • Grant cluster-level access permissions to temporary task pods

      - apiGroups: [""]
        resources: [events]
        verbs: [create, patch, get, list]
      - apiGroups: [volumepopulators.storage.alibabacloud.com]
        resources: [ossvolumepopulators]
        verbs: [get, list, watch]
  • Argo Workflows is installed.

    Detailed steps

    1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

    2. On the Clusters page, click the name of your cluster. In the left-side navigation pane, choose Operations > Add-ons.

    3. On the Add-ons page, find Argo Workflows and install it as prompted.

  • This example uses serverless compute (ECI) to run data prefilling tasks and workflows, so you must also install the ack-virtual-node component. If you use non-serverless compute for verification, remove the related label alibabacloud.com/eci: "true" from resources.

1. Authorize data prefilling tasks to access OSS

In generic mode, data prefilling tasks run as temporary pods in the ack-volume-populator namespace. You must grant these pods permission to access the OSS bucket containing the source data.

  • RRSA method: Dynamically assign temporary, auto-rotating RAM roles to pods for fine-grained, application-level permission isolation with higher security.

  • AccessKey method: Store static, long-term credentials in a Secret. Simple to configure but less secure.

RRSA method

1. Enable RRSA in your cluster

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the target cluster and click its name. In the left-side pane, click Cluster Information.

  3. On the Basic Information tab, find the Security and Auditing section. To the right of RRSA OIDC, click Enable. Follow the on-screen prompts to enable RRSA during off-peak hours.

When the cluster status changes from Updating to Running, RRSA is enabled.

Important

After you enable RRSA, the maximum validity period for new ServiceAccount tokens created in the cluster is limited to 12 hours.

2. Create a RAM role and grant permissions

Create a RAM role for pods to assume, enabling OSS access through RRSA authentication.

View steps

  1. Create a RAM role.

    1. Go to the RAM console - Create Role page, select Principal Type as Identity Provider, then click Switch to Policy Editor to enter the Visual Editor page.

    2. Select Principal as Identity Provider, click Edit, and follow the instructions below to complete configuration.

      Main configurations are as follows. Keep other parameters at default. See Create a RAM role for an OIDC identity provider for details.

      Configuration

      Description

      Identity Provider Type

      OIDC.

      Identity Provider

      Select ack-rrsa-<cluster_id>. Replace <cluster_id> with your cluster ID.

      Condition

      Manually add oidc:sub.

      • Key: Select oidc:sub.

      • Operator: Select StringEquals.

      • Value: Enter system:serviceaccount:ack-volume-populator:plugin-account.

      Role Name

      This example uses demo-role-for-rrsa.

  2. Create a permission policy.

    Following the principle of least privilege, create a custom policy granting access to the target OSS bucket (OSS read-only or read-write permissions).

    Go to the RAM console - Create Policy page, switch to JSON Editor, and configure the policy script as instructed.

    If you already have a RAM role with OSS permissions, modify its trust policy for reuse. See Use an existing RAM role and grant permissions.

    OSS read-only policy

    Replace <myBucketName> with your actual bucket name.
    {
        "Statement": [
            {
                "Action": [
                    "oss:Get*",
                    "oss:List*"
                ],
                "Effect": "Allow",
                "Resource": [
                    "acs:oss:*:*:<myBucketName>",
                    "acs:oss:*:*:<myBucketName>/*"
                ]
            }
        ],
        "Version": "1"
    }
    1. Go to the RAM console - Users page, click the target user in the RAM user list, then in the AccessKey tab, click Create AccessKey.

    2. Follow the instructions to create the AccessKey in the dialog box and securely store the AccessKey ID and AccessKey Secret.

AccessKey method

  1. Create a RAM user (skip this step if you already have one). Go to the Create User page in the RAM console. Follow the on-screen instructions to create a RAM user. Set a logon name and password.

  2. Create a permission policy.

    Following the principle of least privilege, create a custom policy granting access to the target OSS bucket (OSS read-only or read-write permissions).

    Go to the RAM console - Create Policy page, switch to JSON Editor, and configure the policy script as instructed.

    OSS read-only policy

    Replace <myBucketName> with your actual bucket name.
    {
        "Statement": [
            {
                "Action": [
                    "oss:Get*",
                    "oss:List*"
                ],
                "Effect": "Allow",
                "Resource": [
                    "acs:oss:*:*:<myBucketName>",
                    "acs:oss:*:*:<myBucketName>/*"
                ]
            }
        ],
        "Version": "1"
    }
  3. Grant the policy to the RAM user.

    1. Go to the Users page in the RAM console. In the Actions column for the target user, click Attach Policy.

    2. In the Access Policy section, search for and select the policy you created, and add the permissions.

  4. Create an AccessKey for the RAM user to store as a Secret for data prefilling.

    1. Go to the RAM console - Users page, click the target user in the RAM user list, then in the AccessKey tab, click Create AccessKey.

    2. Follow the instructions to create the AccessKey in the dialog box and securely store the AccessKey ID and AccessKey Secret.

  5. Create a Secret in the cluster.

    Use the following YAML to create a Secret in the ack-volume-populator namespace to store the AccessKey.

    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
      # The namespace must be set to ack-volume-populator
      namespace: ack-volume-populator
    stringData:
      # Replace with the previously obtained AccessKey ID
      accessKeyId: <your-AccessKey-ID>
      # Replace with the previously obtained AccessKey Secret
      accessKeySecret: <your-AccessKey-Secret>

2. Create OSSVolumePopulator (OSSVP)

Create an OSSVolumePopulator resource in the same namespace as your application and PVC to define the data source and specify mode as generic along with the authorization method.

apiVersion: storage.alibabacloud.com/v1beta1
kind: OSSVolumePopulator
metadata:
  name: generic-demo
  # Must be in the same namespace as the application and PVC
  namespace: argo 
spec:
  bucket: my-test-bucket
  region: cn-hangzhou
  endpoint: oss-cn-hangzhou-internal.aliyuncs.com
  path: /many-files/
  # Generic mode for any backend storage volume
  mode: generic
  generic:
    # Add labels to data population task pods for scheduling to ECI pods
    labels:
      alibabacloud.com/eci: "true"
    # Add annotations to data population task pods for configuring ECI specs
    annotations:
      k8s.aliyun.com/eci-use-specs: "2-4Gi"
    # Choose either secretRef or rrsaConfigs
    # secretRef: oss-secret
    rrsaConfigs:
      # ARN of the RAM role used for RRSA authorization
      roleArn: "acs:ram::1234567*****:role/oss-populator"
      # ARN of the cluster's OIDC Provider
      oidcProviderArn: "acs:ram::1234567*****:oidc-provider/my-oidc-provider"
    # Configure affinity for data population task pods to schedule to specific nodes
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "disktype"
                  operator: NotIn
                  values:
                    - "hdd"
    # Configure tolerations for data population task pods
    tolerations:
      - key: "virtual-kubelet.io/provider"
        operator: Equal
        value: "alibabacloud"
        effect: NoSchedule
    # Maximum throughput (MB/s)
    # throughput: 1000

Parameter descriptions:

Name

Description

Optional

Default

namespace

OSSVolumePopulator must reside in the same namespace as the application and PVC.

No

N/A

bucket

OSS bucket name.

No

N/A

region

OSS region.

No

N/A

endpoint

OSS service endpoint address.

No

N/A

path

Path prefix within the OSS bucket, such as /data/.

Yes

/

mode

Operation mode. Options:

  • generic (default): Generic population mode for any backend dynamic volume (e.g., cloud disks, CPFS general-purpose edition). Creates a temporary pod in the ack-volume-populator namespace to download data, consuming cluster compute resources.

  • bmcpfs-dataflow: High-performance mode exclusively for CPFS for Lingjun volumes. Uses native data flow for higher efficiency without consuming cluster compute resources.

  • auto: Automatically selects the best mode based on target storage type. However, it usually falls back to generic mode. Manually specify bmcpfs-dataflow if needed.

Yes

auto

generic.labels

Add labels to data population task pods for scheduling to ECI.

Yes

N/A

generic.annotations

Add annotations to data population task pods for configuring ECI specs.

Yes

N/A

generic.secretRef

Name of the Secret storing the AccessKey. Choose either this or generic.rrsaConfigs.

Yes

N/A

generic.rrsaConfigs.roleArn

ARN of the RAM role used for RRSA authorization.

Go to the RAM console - Roles page, click the RAM role name, and get it from the details page.

Required when using RRSA

N/A

generic.rrsaConfigs.oidcProviderArn

ARN of the cluster's OIDC Provider.

Go to the ACK clusters page, click the target cluster name, select Cluster Information, and get it from the Basic Information tab under RRSA OIDC.

Required when using RRSA

N/A

generic.affinity

Configure affinity for data population task pods to schedule to specific nodes.

Yes

N/A

generic.tolerations

Configure tolerations for data population task pods.

Yes

N/A

generic.throughput

Maximum throughput (MB/s) to set the maximum download speed for data population task pods.

Unlike bmcpfs-dataflow, generic performance is constrained by node resources (network, CPU) and storage write capability. This setting is an upper limit primarily to prevent prefilling tasks from consuming excessive cluster resources.

Yes

Unlimited (actual rate depends on node network, CPU, and storage write performance)

3. Prepare StorageClass

This scenario requires a StorageClass for dynamically provisioning cloud disks. ACK provides a default StorageClass and also supports manually creating a StorageClass.

  • This example uses alicloud-disk-essd, which has reclaimPolicy set to Delete and volumeBindingMode set to Immediate, suitable for scenarios using serverless compute where zone awareness isn't required.

  • If you run workflows on non-serverless compute, use a StorageClass with volumeBindingMode set to WaitForFirstConsumer (such as alicloud-disk-topology-alltype) to ensure cloud disks and application pods are created in the same zone, avoiding scheduling failures due to zone mismatches.

4. Create Argo Workflow

The following Workflow example uses an ephemeral volumeClaimTemplate to dynamically create an independent cloud disk with prefilled initial data for each parallel task.

Expand to view sample code

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallel-data-process-with-ossvp-
  namespace: argo
spec:
  # Define input parameters
  arguments:
    parameters:
      - name: number
        value: 2

  entrypoint: main
  # Define volume claim template for each concurrent replica in the Workflow
  volumes:
    - name: scratch-volume
      # Declare as ephemeral volume, automatically deleted after Workflow completion
      ephemeral:
        volumeClaimTemplate:
          metadata:
            labels:
              diskType: scratch-volume
          spec:
            accessModes: [ "ReadWriteOnce" ]
            # For non-serverless compute, replace with alicloud-disk-topology-alltype
            storageClassName: "alicloud-disk-essd"
            resources:
              requests:
                # Must be at least the size of the OSS data source referenced by OSSVP
                storage: 20Gi
            # Reference OSSVP as data source
            dataSourceRef:
              apiGroup: storage.alibabacloud.com
              kind: OSSVolumePopulator
              name: generic-demo

  templates:
    - name: main
      dag:
        tasks:
          # Execute multiple echo tasks in parallel
          - name: echo-task
            template: echo-template
            arguments:
              parameters:
                - name: index
                  value: "{{item}}"
            withSequence:
              count: "{{workflow.parameters.number}}"

    - name: echo-template
      metadata:
        labels:
          # Run tasks on ECI
          alibabacloud.com/eci: "true" 
      container:
        image: mirrors-ssl.aliyuncs.com/busybox:latest
        command:
          - sh
          - -c
        args:
          - |
            echo "Subtask started, ID: {{inputs.parameters.index}}"
            echo "Creating a new log file..."
            touch /scratch-volume/"{{inputs.parameters.index}}-logs"
            echo "Listing contents from the disk populated by OSSVP:"
            ls /scratch-volume
            echo "Subtask completed, ID: {{inputs.parameters.index}}"
        volumeMounts:
        - name: scratch-volume
          mountPath: /scratch-volume
        resources:
          limits:
            cpu: '4'
            memory: 16Gi
          requests:
            cpu: '4'
            memory: 16Gi
      inputs:
        parameters:
          - name: index

After creating the Workflow, check logs from any task pod to confirm successful reading of prefilled data.

In production environments, you can also use Argo Workflows' Artifact feature to persist final computation results in OSS.
# Replace <your-workflow-pod-name> with the actual pod name
kubectl -n argo logs <your-workflow-pod-name>

Expected output:

Subtask started, ID: 1
Creating a new log file...
Listing contents from the disk populated by OSSVP:
1-logs
lost+found
results-2025-04-16T07:48:00Z
...
Subtask completed, ID: 1

Result analysis:

  • 1-logs: File newly written by the task, verifying the volume is read-write capable and storage is isolated between parallel tasks.

  • results-2025-04-16T07:48:00Z and similar files: Data prefilled from OSS to the cloud disk, confirming the prefilling feature works correctly.

  • lost+found: Directory automatically generated after file system formatting. Ignore it.

Resource cleanup guide

After completing batch processing tasks, promptly release the isolated cloud disk volumes and related workflows dynamically created by Argo Workflow.

Resources to release:

  • Argo Workflow instance

  • Temporary PVCs automatically created by the Workflow

  • Backend storage resources (cloud disks) automatically created by the PVCs.

Release flow:

  1. Delete the Argo Workflow:

    For this scenario, typically just delete the Workflow resource. Because the workflow uses ephemeral volume claims, deleting the Workflow automatically cascades deletion of all PVCs it created. The StorageClass sets reclaimPolicy: Delete, which further triggers automatic deletion of backend cloud disks, releasing resources and stopping billing.

    # Replace <workflow-name> with the actual Workflow name
    kubectl -n argo delete workflow <workflow-name>
  2. Verify resource cleanup:

    • Verify PVCs: Run kubectl -n argo get pvc to confirm all PVCs related to this workflow have been deleted.

    • Verify cloud disk resources: Go to ECS console - Block Storage - Disks and confirm no cloud disk resources remain from this workflow.

    • Verify OSS source data: This operation does not affect OSS source data. To verify, go to the OSS console and confirm the dataset remains intact.

Production environment recommendations

  • Cost and resource management:

    • Set automatic resource reclamation: Configure reclaimPolicy: Delete for the StorageClass used to dynamically create volumes. This ensures high-performance storage resources are automatically cleaned up after task completion.

    • Optimize data population costs (generic mode): In generic mode, data population consumes compute resources. By configuring affinity and tolerations in OSSVolumePopulator, you can schedule temporary task pods to lower-cost serverless compute (including ACS, ECI) or spot instances.

    • Plan storage capacity: The storage capacity requested when creating a PVC must exceed the source data size. Otherwise, data population fails due to insufficient space.

  • Performance and stability:

    • Cloud disk zone alignment: Cloud disks are zone-scoped resources. When using ECS nodes, set the StorageClass volumeBindingMode to WaitForFirstConsumer to ensure cloud disks and application pods are always created in the same zone, avoiding mount failures from cross-zone scheduling.

    • Balance CPFS for Lingjun prefill modes: If you prioritize quick Persistent Volume (PV) readiness and can tolerate some latency on first file reads, choose dataType: metadata. If your workload requires high first-read performance and full data prefilling, choose dataType: metadataAndData.

    • Status monitoring: Use kubectl describe ossvp <name> to monitor data prefill task status and events for quick troubleshooting.

  • Security and permissions:

    • Use RRSA for secure authorization: In generic mode, when granting OSS access permissions to data population task pods, prefer the RRSA method to avoid security risks from AccessKey leakage.

Billing information

This feature involves the following charges:

  • High-performance storage fees: Charged based on the created volume type (such as CPFS for Lingjun or cloud disks) and their lifecycle.

  • OSS storage fees: Storage fees for source data in OSS.

  • Data transfer fees: Configure the OSS internal endpoint in OSSVP to avoid traffic charges. Using the public endpoint incurs traffic fees.

  • Compute resource fees (only for generic mode): Temporary pods in generic mode consume cluster compute resources (CPU, memory, bandwidth) and are billed based on specifications and duration.

  • For CPFS for Lingjun in bmcpfs-dataflow mode, data stream tasks are free to use because the data stream feature is in public preview.

FAQ

After prefilling completes, if I update source files in OSS, will the data in the volume sync automatically?

No. Data prefilling is a one-time operation during volume creation. Once the volume is created and populated, its contents are decoupled from the OSS source data. Any subsequent changes to OSS will not sync to the created volume.

Why does my PVC stay in Pending status?

Pending is normal during data prefilling. If it remains Pending for an extended period, troubleshoot as follows.

  1. kubectl describe pvc <pvc-name> -n <namespace>: Check PVC events for populator-related error messages.

  2. kubectl describe ossvp <ossvp-name> -n <namespace>: Check OSSVP status and events for population task status, progress, or failure reasons.

  3. If using generic mode, check for failed pods in the ack-volume-populator namespace and review their logs. Common causes include insufficient OSS permissions, network issues, or insufficient storage space.