All Products
Search
Document Center

Container Service for Kubernetes:Best practice for OSS read/write splitting

Last Updated:Mar 26, 2026

OSS persistent volumes (PVs) support multiple clients, but full write support degrades read performance. Read/write splitting addresses this by routing reads and writes through separate mount paths — a read-optimized path for data loading and a write path for checkpoints or outputs. This significantly improves throughput in read-intensive workloads such as model training, inference, and data analytics.

This topic explains how to implement read/write splitting for OSS PVs using ossfs clients or the OSS SDK, with a worked example based on a MNIST handwriting recognition training job.

Prerequisites

Before you begin, ensure that you have:

  • A cluster with the Container Storage Interface (CSI) component installed. Different clients require different CSI versions — upgrade to the latest version. For more information, see Manage csi-plugin and csi-provisioner components.

  • An OSS bucket in the same Alibaba Cloud account as the cluster.

Important

Keep the OSS bucket in the same account as the cluster. Cross-account OSS access is not recommended.

Choose a client

OSS PVs support three clients: ossfs 1.0, ossfs 2.0, and strmvol. All three support read-only access. Their write support differs:

Client Read-only Read/write Best for
ossfs 1.0 Yes Full write General read/write workloads; direct read mode available (v1.91+)
ossfs 2.0 Yes Sequential append writes only Read-heavy workloads; requires CSI >= 1.33.1
strmvol Yes Many small files (datasets, time-series logs, quantitative backtesting)

For detailed guidance on choosing a client, see Client selection reference.

Use cases

Read-only access

Set the PV access mode to ReadOnlyMany to prevent accidental data modification. This is appropriate for inference, data analytics, and log queries in big data services.

  • For read-heavy workloads, use ossfs 2.0 (requires CSI >= 1.33.1). See Use ossfs 2.0 PVs.

  • For workloads that read many small files, use strmvol. See Use strmvol PVs.

To tune ossfs 1.0 for read-only scenarios, configure these parameters in otherOpts. Most workloads do not need to change these defaults — adjust them only when you have a specific performance requirement.

Parameter Default Description
kernel_cache Disabled Enables the kernel buffer cache. Use when real-time access to the latest content is not required. Cache hits are served from memory not used by other processes.
parallel_count 20 Number of concurrent shards for uploading or downloading large files.
max_multireq 20 Maximum concurrent requests for object metadata listing. Must be >= parallel_count.
max_stat_cache_size 1000 Number of cached metadata entries. Set to 0 to disable. Increase this value to speed up ls in directories with many files — 10,000 entries consume about 40 MB of memory.
direct_read Disabled Direct read mode for read-only scenarios (ossfs 1.91 and later). See Features and performance testing of the new ossfs 1.0 version and Performance optimization for read-only scenarios.

Read/write access

Set the PV access mode to ReadWriteMany for workloads that write data.

Note

ossfs does not guarantee consistency for concurrent write operations. If multiple writers modify the same objects simultaneously, data corruption can occur. For sequential checkpoint writes during model training, use a single writer per path.

Warning

Deleting or modifying files in the mounted path also deletes or modifies the corresponding objects in the OSS bucket. To protect against accidental data loss, enable versioning for the bucket.

For read-intensive workloads where reads and writes go to different paths (such as model training), implement read/write splitting: mount the read path as ReadOnlyMany with cache parameters enabled, and handle writes separately using a ReadWriteMany PV or the OSS SDK.

How read/write splitting works

Read/write splitting routes reads and writes through separate mount points, each pointing to a different subpath of the same OSS bucket. Mounting different subpaths is the core design decision — it isolates read I/O and write I/O so they never share the same mount point or interfere with each other.

  • Read path — Mount a subpath (for example, /tf-train/train/data) as ReadOnlyMany with cache settings enabled. The kernel cache and metadata cache mean repeated reads are served from memory rather than from OSS.

  • Write path — Mount a different subpath (for example, /tf-train/training_logs) as ReadWriteMany, or write directly to OSS using the SDK. This isolates write traffic from the read path.

Example: MNIST handwriting recognition training

This example uses a handwriting image recognition and training job to demonstrate read/write splitting. The job:

  1. Reads the training dataset from /tf-train/train/data in the OSS bucket using a read-only PV.

  2. Writes training checkpoints to /tf-train/training_logs using either a read/write PV or the OSS SDK.

Before you start, download the MNIST dataset and upload it to the /tf-train/train/data directory in your OSS bucket:

The following screenshot shows how the files are organized in the OSS bucket:

oss-读写分离1

Implement read/write operations using ossfs

Because checkpoint writing is a sequential append operation, ossfs 1.0 or ossfs 2.0 both work for the write path.

  1. Deploy the training application. The application mounts the /tf-train subpath of the OSS bucket to the /mnt directory of the pod. For more information on configuring an OSS PV, see Use ossfs 1.0 statically provisioned volumes or Use ossfs 2.0 PVs.

    1. Create an ossfs 1.0 PV:

      cat << EOF | kubectl apply -f -
      apiVersion: v1
      kind: Secret
      metadata:
        name: oss-secret
        namespace: default
      stringData:
        akId: "<your-accesskey-id>"
        akSecret: "<your-accesskey-secret>"
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: tf-train-pv
        labels:
          alicloud-pvname: tf-train-pv
      spec:
        capacity:
          storage: 10Gi
        accessModes:
          - ReadWriteMany
        persistentVolumeReclaimPolicy: Retain
        csi:
          driver: ossplugin.csi.alibabacloud.com
          volumeHandle: tf-train-pv
          nodePublishSecretRef:
            name: oss-secret
            namespace: default
          volumeAttributes:
            bucket: "<your-bucket-name>"
            url: "oss-<region>.aliyuncs.com"
            otherOpts: "-o max_stat_cache_size=0 -o allow_other"
            path: "/tf-train"
      ---
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: tf-train-pvc
      spec:
        accessModes:
        - ReadWriteMany
        resources:
          requests:
            storage: 10Gi
        selector:
          matchLabels:
            alicloud-pvname: tf-train-pv
      EOF
    2. Create the training pod:

      During training, intermediate files are written to /mnt/training_logs in the pod and uploaded by ossfs to /tf-train/training_logs in the OSS bucket.

      cat << EOF | kubectl apply -f -
      apiVersion: v1
      kind: Pod
      metadata:
        labels:
          app: tfjob
        name: tf-mnist
        namespace: default
      spec:
        containers:
        - command:
          - sh
          - -c
          - python /app/main.py
          env:
          - name: NVIDIA_VISIBLE_DEVICES
            value: void
          - name: gpus
            value: "0"
          - name: workers
            value: "1"
          - name: TEST_TMPDIR
            value: "/mnt"
          image: registry.cn-beijing.aliyuncs.com/tool-sys/tf-train-demo:rw
          imagePullPolicy: Always
          name: tensorflow
          ports:
          - containerPort: 20000
            name: tfjob-port
            protocol: TCP
          volumeMounts:
            - name: train
              mountPath: "/mnt"
          workingDir: /root
        priority: 0
        restartPolicy: Never
        securityContext: {}
        terminationGracePeriodSeconds: 30
        volumes:
        - name: train
          persistentVolumeClaim:
            claimName: tf-train-pvc
      EOF
  2. Verify that data is read and written correctly.

    1. Check the pod status:

      kubectl get pod tf-mnist

      Wait a few minutes for the pod status to change from Running to Completed:

      NAME       READY   STATUS      RESTARTS   AGE
      tf-mnist   0/1     Completed   0          2m12s
    2. Check the data load time in the pod logs:

      kubectl logs tf-mnist | grep dataload

      Expected output (actual time varies by instance performance and network):

      dataload cost time:  1.54191803932
    3. Log in to the OSS Management Console and verify that files have been generated under /tf-train/training_logs in the bucket. Files present in that directory confirm that read and write operations are working.

Optimize read performance using read/write splitting

Split the single read/write PV into two separate PVs: a read-only PV with cache tuning for the dataset, and a write PV for checkpoints. The training service code does not need to change — only the mount configuration differs.

Two write options are available:

  • Option 1: Use a separate read/write ossfs PV for checkpoint writes.

  • Option 2: Use the OSS SDK to write checkpoints directly, bypassing ossfs entirely.

Option 1: Write using a read/write ossfs PV

  1. Create a read-only ossfs 1.0 PV for the dataset. The key configuration changes:

    • Set accessModes to ReadOnlyMany on both the PV and PersistentVolumeClaim (PVC). Mount only the dataset subpath /tf-train/train/data.

    • In otherOpts, add -o kernel_cache -o max_stat_cache_size=10000 -o umask=022:

      • kernel_cache enables in-memory read caching.

      • max_stat_cache_size=10000 caches 10,000 metadata entries (approximately 40 MB). Adjust based on your instance type and dataset size.

      • umask=022 grants read access to container processes running as non-root users.

    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
      namespace: default
    stringData:
      akId: "<your-accesskey-id>"
      akSecret: "<your-accesskey-secret>"
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: tf-train-pv
      labels:
        alicloud-pvname: tf-train-pv
    spec:
      capacity:
        storage: 10Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: tf-train-pv
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: "<your-bucket-name>"
          url: "oss-<region>.aliyuncs.com"
          otherOpts: "-o kernel_cache -o max_stat_cache_size=10000 -o umask=022 -o allow_other"
          path: "/tf-train/train/data"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: tf-train-pvc
    spec:
      accessModes:
      - ReadOnlyMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: tf-train-pv
    EOF
  2. Create a read/write ossfs 1.0 PV for checkpoints. This PV mounts only the checkpoint subpath /tf-train/training_logs. Metadata caching is disabled (max_stat_cache_size=0) because checkpoint writes are sequential and do not benefit from caching.

    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: tf-logging-pv
      labels:
        alicloud-pvname: tf-logging-pv
    spec:
      capacity:
        storage: 10Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: tf-logging-pv
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: "<your-bucket-name>"
          url: "oss-<region>.aliyuncs.com"
          otherOpts: "-o max_stat_cache_size=0 -o allow_other"
          path: "/tf-train/training_logs"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: tf-logging-pvc
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: tf-logging-pv
    EOF
  3. Deploy the training pod with both PVs mounted.

    The training service code does not need modification. Mount both PVs in the pod spec — the read-only PV at /mnt/train/data and the read/write PV at /mnt/training_logs.
    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        app: tfjob
      name: tf-mnist
      namespace: default
    spec:
      containers:
      - command:
        - sh
        - -c
        - python /app/main.py
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: void
        - name: gpus
          value: "0"
        - name: workers
          value: "1"
        - name: TEST_TMPDIR
          value: "/mnt"
        image: registry.cn-beijing.aliyuncs.com/tool-sys/tf-train-demo:rw
        imagePullPolicy: Always
        name: tensorflow
        ports:
        - containerPort: 20000
          name: tfjob-port
          protocol: TCP
        volumeMounts:
          - name: train
            mountPath: "/mnt/train/data"
          - name: logging
            mountPath: "/mnt/training_logs"
        workingDir: /root
      priority: 0
      restartPolicy: Never
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: train
        persistentVolumeClaim:
          claimName: tf-train-pvc
      - name: logging
        persistentVolumeClaim:
          claimName: tf-logging-pvc
    EOF

Option 2: Write using the OSS SDK

Use the OSS SDK to write checkpoints directly to OSS without a read/write PV. The training pod uses a read-only PV for the dataset and calls the SDK for checkpoint uploads.

  1. Add the OSS Python SDK to the container image:

    RUN pip install oss2

    For installation options, see Installation.

  2. Modify the training code to upload checkpoints using the SDK. The original code saves checkpoints to log_dir every 100 iterations using tf.train.Saver with max_to_keep=0 (retaining all checkpoints). After 1,000 iterations, this produces 10 sets of checkpoint files in OSS.

    • Set max_to_keep=1 to retain only the latest checkpoint, reducing memory usage.

    • Upload each checkpoint file to OSS using put_object_from_file after saving.

    In workloads with separated read and write paths, use asynchronous I/O with the SDK to further improve training throughput.
    def train():
        ...
    saver = tf.train.Saver(max_to_keep=0)
    
        for i in range(FLAGS.max_steps):
            if i % 10 == 0:  # Record summaries and test-set accuracy
                summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
                print('Accuracy at step %s: %s' % (i, acc))
                if i % 100 == 0:
                    print('Save checkpoint at step %s: %s' % (i, acc))
                    saver.save(sess, FLAGS.log_dir + '/model.ckpt', global_step=i)

    Replace it with SDK-based uploads. Two changes reduce memory usage and eliminate the read/write PV: Read the AccessKey and bucket settings from environment variables. For credential configuration, see Configure access credentials.

    import oss2
    from oss2.credentials import EnvironmentVariableCredentialsProvider
    
    auth = oss2.ProviderAuth(EnvironmentVariableCredentialsProvider())
    url = os.getenv('URL','<default-url>')
    bucketname = os.getenv('BUCKET','<default-bucket-name>')
    bucket = oss2.Bucket(auth, url, bucketname)
    
    ...
    def train():
      ...
      saver = tf.train.Saver(max_to_keep=1)
    
     for i in range(FLAGS.max_steps):
        if i % 10 == 0:  # Record summaries and test-set accuracy
          summary, acc = sess.run([merged, accuracy], feed_dict=feed_dict(False))
          print('Accuracy at step %s: %s' % (i, acc))
          if i % 100 == 0:
            print('Save checkpoint at step %s: %s' % (i, acc))
            saver.save(sess, FLAGS.log_dir + '/model.ckpt', global_step=i)
            # FLAGS.log_dir = os.path.join(os.getenv('TEST_TMPDIR', '/mnt'),'training_logs')
            for path,_,file_list in os.walk(FLAGS.log_dir) :
              for file_name in file_list:
                bucket.put_object_from_file(os.path.join('tf-train/training_logs', file_name), os.path.join(path, file_name))

    The modified container image is registry.cn-beijing.aliyuncs.com/tool-sys/tf-train-demo:ro.

  3. Deploy the pod with the read-only PV and SDK credentials. The pod configuration sets accessModes to ReadOnlyMany and adds OSS_ACCESS_KEY_ID and OSS_ACCESS_KEY_SECRET environment variables so the SDK can authenticate using the same credentials as the PV. <details> <summary>Click to view the full YAML</summary>

    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Secret
    metadata:
      name: oss-secret
      namespace: default
    stringData:
      akId: "<your-accesskey-id>"
      akSecret: "<your-accesskey-secret>"
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: tf-train-pv
      labels:
        alicloud-pvname: tf-train-pv
    spec:
      capacity:
        storage: 10Gi
      accessModes:
        - ReadOnlyMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: tf-train-pv
        nodePublishSecretRef:
          name: oss-secret
          namespace: default
        volumeAttributes:
          bucket: "<your-bucket-name>"
          url: "oss-<region>.aliyuncs.com"
          otherOpts: "-o kernel_cache -o max_stat_cache_size=10000 -o umask=022 -o allow_other"
          path: "/tf-train/train/data"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: tf-train-pvc
    spec:
      accessModes:
      - ReadOnlyMany
      resources:
        requests:
          storage: 10Gi
      selector:
        matchLabels:
          alicloud-pvname: tf-train-pv
    ---
    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        app: tfjob
      name: tf-mnist
      namespace: default
    spec:
      containers:
      - command:
        - sh
        - -c
        - python /app/main.py
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: void
        - name: gpus
          value: "0"
        - name: workers
          value: "1"
        - name: TEST_TMPDIR
          value: "/mnt"
        - name: OSS_ACCESS_KEY_ID      #The source of the AccessKey is the same as that of the PV.
          valueFrom:
            secretKeyRef:
              name: oss-secret
              key: akId
        - name: OSS_ACCESS_KEY_SECRET  #The source of the AccessKey is the same as that of the PV.
          valueFrom:
            secretKeyRef:
              name: oss-secret
              key: akSecret
        - name: URL                    #You can ignore this if a default URL is configured.
          value: "https://oss-<region>.aliyuncs.com"
        - name: BUCKET                 #You can ignore this if a default BUCKET is configured.
          value: "<bucket-name>"
        image: registry.cn-beijing.aliyuncs.com/tool-sys/tf-train-demo:ro
        imagePullPolicy: Always
        name: tensorflow
        ports:
        - containerPort: 20000
          name: tfjob-port
          protocol: TCP
        volumeMounts:
          - name: train
            mountPath: "/mnt/train/data"
        workingDir: /root
      priority: 0
      restartPolicy: Never
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: train
        persistentVolumeClaim:
          claimName: tf-train-pvc
    EOF

    </details>

Verify read/write splitting

After deploying with either write option:

  1. Check the pod status:

    kubectl get pod tf-mnist

    Wait until the status changes from Running to Completed:

    NAME       READY   STATUS      RESTARTS   AGE
    tf-mnist   0/1     Completed   0          2m25s
  2. Check the data load time:

    kubectl logs tf-mnist | grep dataload

    With read/write splitting and kernel cache enabled, the data load time drops compared to the baseline without splitting:

    dataload cost time:  0.843528985977

    The baseline time without splitting is approximately 1.54 seconds. The improvement is more significant for large-scale training jobs or workloads that load data repeatedly.

  3. Log in to the OSS Management Console and verify that checkpoint files are present under /tf-train/training_logs in the bucket.

    image.png

References

OSS SDK reference

All examples in this topic use the Python SDK. The following SDKs are also available:

For PHP, Node.js, Browser.js, .NET, Android, iOS, and Ruby, see SDK Reference.

Other write tools

These tools also support writing objects to OSS and can be used for the write path in read/write splitting scenarios:

Tool Reference
OSS Management Console Quick start
OpenAPI PutObject
ossutil command line interface cp (upload files)
ossbrowser graphical management tool Common operations