All Products
Search
Document Center

Container Service for Kubernetes:Using NVMe cloud disk multi-attach and Reservation for data sharing between applications

Last Updated:Mar 26, 2026

Use multi-attach to share a single ESSD, ESSD AutoPL, or other NVMe-capable cloud disk across up to 16 ECS instances in the same zone simultaneously — or share a zone-redundant storage ESSD across nodes in the same region. Combined with NVMe Persistent Reservation (PR), multi-attach gives your workloads shared storage access with precise write-permission control, enabling efficient data sharing and fast failover in an ACK cluster.

Use cases

  • Data sharing: After one node writes data to a shared NVMe disk, all other attached nodes can read it immediately. A single container image stored on an NVMe disk can be loaded by multiple instances running the same OS, reducing storage costs and improving read/write performance.

    image

  • High-availability failover: Traditional clustered databases — including Oracle Real Application Clusters (RAC), SAP High-performance ANalytic Appliance (HANA), and cloud-native high-availability (HA) databases — are vulnerable to single points of failure (SPOFs). A shared NVMe disk keeps storage accessible when a compute node fails. Deploy your workload in primary/secondary mode: when the primary instance fails, run an NVMe PR command to revoke its write permissions, then promote the secondary instance. This prevents split-brain writes and ensures data consistency. The failover sequence:

    1. The primary database instance (Database Instance 1) fails and stops serving traffic.

    2. Run an NVMe PR command to block writes to Database Instance 1 and grant write access to Database Instance 2.

    3. Restore Database Instance 2 to the same state as Database Instance 1 (for example, by replaying logs).

    4. Database Instance 2 takes over as the primary instance.

    PR is part of the NVMe specification. It controls read and write permissions at the disk level to ensure compute nodes write data as expected. For details, see NVM Express Base Specification.

    image

  • Distributed data cache acceleration: Data lakes built on Object Storage Service (OSS) offer high append-write throughput but suffer from high latency and low random read/write performance. Attach a high-speed, multi-attach-enabled cloud disk as a shared cache layer across compute nodes to significantly improve access performance.

    image

  • Machine learning: After sample data is labeled and written, distribute it across nodes for parallel training without copying data over the network. Each compute node reads directly from the shared disk, reducing transfer latency and accelerating large-scale model training.

    image

Billing

The multi-attach feature does not incur additional fees. Resources that support the NVMe protocol are billed based on their original billing methods. For cloud disk pricing, see Elastic Block Storage volumes.

Limitations

  • A single NVMe cloud disk can be attached to a maximum of 16 ECS instances in the same zone at the same time.

  • To read and write to a cloud disk from multiple nodes concurrently, mount the cloud disk using volumeDevices. This mounts the disk as a block device and does not support file system access. Use volumeMode: Block and accessModes: ReadWriteMany in your PersistentVolumeClaim (PVC).

  • For the full list of limits, see Limits of the multi-attach feature.

Prerequisites

Before you begin, ensure that you have:

  • An ACK managed cluster running Kubernetes 1.20 or later. To create one, see Create an ACK managed cluster.

  • csi-plugin and csi-provisioner at version v1.24.10-7ae4421-aliyun or later. To upgrade, see Manage the csi-plugin and csi-provisioner components.

  • At least two nodes in the same zone that support the multi-attach feature. For supported instance families, see Limits of the multi-attach feature.

  • A containerized application that meets both of the following requirements:

    • Supports concurrent access to the same cloud disk from multiple replicas.

    • Ensures data consistency using NVMe Reservation or an equivalent mechanism.

For background reading, see:

Application example

The following example application demonstrates lease-based leader election over a shared NVMe block device. Multiple replicas compete for a lease written directly to the disk. Only one replica holds the lease at a time — if it stops refreshing, another replica preempts it using NVMe Reservation commands.

Key design notes:

  • O_DIRECT is used to open the block device, bypassing the page cache and ensuring reads reflect what was actually written to disk.

  • The example uses the Linux kernel's simplified Reservation interface (<linux/pr.h> ioctls). Alternatives that require elevated privileges:

    • C: ioctl(fd, NVME_IOCTL_IO_CMD, &cmd);

    • CLI: nvme-cli

  • For the full NVMe Reservation specification, see NVMe Specification.

Expand to view the source code of the application example

C

#define _GNU_SOURCE
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <linux/pr.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <time.h>
#include <unistd.h>

const char *disk_device = "/dev/data-disk";
uint64_t magic = 0x4745D0C5CD9A2FA4;

void panic(const char *restrict format, ...) {
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    exit(EXIT_FAILURE);
}

struct lease {
    uint64_t magic;
    struct timespec acquire_time;
    char holder[64];
};

volatile bool shutdown = false;
void on_term(int signum) {
    shutdown = true;
}

struct lease *lease;
const size_t lease_alloc_size = 512;

void acquire_lease(int disk_fd) {
    int ret;

    struct pr_registration pr_reg = {
        .new_key = magic,
        .flags = PR_FL_IGNORE_KEY,
    };
    ret = ioctl(disk_fd, IOC_PR_REGISTER, &pr_reg);
    if (ret != 0)
        panic("failed to register (%d): %s\n", ret, strerror(errno));

    struct pr_preempt pr_pre = {
        .old_key = magic,
        .new_key = magic,
        .type  = PR_WRITE_EXCLUSIVE,
    };
    ret = ioctl(disk_fd, IOC_PR_PREEMPT, &pr_pre);
    if (ret != 0)
        panic("failed to preempt (%d): %s\n", ret, strerror(errno));

    // register again in case we preempted ourselves
    ret = ioctl(disk_fd, IOC_PR_REGISTER, &pr_reg);
    if (ret != 0)
        panic("failed to register (%d): %s\n", ret, strerror(errno));
    fprintf(stderr, "Register as key %lx\n", magic);


    struct pr_reservation pr_rev = {
        .key   = magic,
        .type  = PR_WRITE_EXCLUSIVE,
    };
    ret = ioctl(disk_fd, IOC_PR_RESERVE, &pr_rev);
    if (ret != 0)
        panic("failed to reserve (%d): %s\n", ret, strerror(errno));

    lease->magic = magic;
    gethostname(lease->holder, sizeof(lease->holder));

    while (!shutdown) {
        clock_gettime(CLOCK_MONOTONIC, &lease->acquire_time);
        ret = pwrite(disk_fd, lease, lease_alloc_size, 0);
        if (ret < 0)
            panic("failed to write lease: %s\n", strerror(errno));
        fprintf(stderr, "Refreshed lease\n");
        sleep(5);
    }
}

int timespec_compare(const struct timespec *a, const struct timespec *b) {
    if (a->tv_sec < b->tv_sec)
        return -1;
    if (a->tv_sec > b->tv_sec)
        return 1;
    if (a->tv_nsec < b->tv_nsec)
        return -1;
    if (a->tv_nsec > b->tv_nsec)
        return 1;
    return 0;
}

int main() {
    assert(lease_alloc_size >= sizeof(struct lease));
    lease = aligned_alloc(512, lease_alloc_size);
    if (lease == NULL)
        panic("failed to allocate memory\n");

    // char *reg_key_str = getenv("REG_KEY");
    // if (reg_key_str == NULL)
    //     panic("REG_KEY env not specified");

    // uint64_t reg_key = atoll(reg_key_str) | (magic << 32);
    // fprintf(stderr, "Will register as key %lx", reg_key);


    int disk_fd = open(disk_device, O_RDWR|O_DIRECT);
    if (disk_fd < 0)
        panic("failed to open disk: %s\n", strerror(errno));

    // setup signal handler
    struct sigaction sa = {
        .sa_handler = on_term,
    };
    sigaction(SIGTERM, &sa, NULL);
    sigaction(SIGINT, &sa, NULL);

    struct timespec last_active_local;
    struct timespec last_active_remote;

    int ret = pread(disk_fd, lease, lease_alloc_size, 0);
    if (ret < 0)
        panic("failed to read lease: %s\n", strerror(errno));

    if (lease->magic != magic) {
        // new disk, no lease
        acquire_lease(disk_fd);
    } else {
        // someone else has the lease
        while (!shutdown) {
            struct timespec now;
            clock_gettime(CLOCK_MONOTONIC, &now);
            if (timespec_compare(&lease->acquire_time, &last_active_remote)) {
                fprintf(stderr, "Remote %s refreshed lease\n", lease->holder);
                last_active_remote = lease->acquire_time;
                last_active_local = now;
            } else if (now.tv_sec - last_active_local.tv_sec > 20) {
                // remote is dead
                fprintf(stderr, "Remote is dead, preempting\n");
                acquire_lease(disk_fd);
                break;
            }
            sleep(5);
            int ret = pread(disk_fd, lease, lease_alloc_size, 0);
            if (ret < 0)
                panic("failed to read lease: %s\n", strerror(errno));
        }
    }

    close(disk_fd);
}

Bash

#!/bin/bash

set -e

DISK_DEVICE="/dev/data-disk"
MAGIC=0x4745D0C5CD9A2FA4

SHUTDOWN=0
trap "SHUTDOWN=1" SIGINT SIGTERM

function acquire_lease() {
    # racqa:
    # 0: aquire
    # 1: preempt

    # rtype:
    # 1: write exclusive

    nvme resv-register $DISK_DEVICE --iekey --nrkey=$MAGIC
    nvme resv-acquire $DISK_DEVICE --racqa=1 --rtype=1 --prkey=$MAGIC --crkey=$MAGIC
    # register again in case we preempted ourselves
    nvme resv-register $DISK_DEVICE --iekey --nrkey=$MAGIC
    nvme resv-acquire $DISK_DEVICE --racqa=0 --rtype=1 --prkey=$MAGIC --crkey=$MAGIC

    while [[ $SHUTDOWN -eq 0 ]]; do
        echo "$MAGIC $(date +%s) $HOSTNAME" | dd of=$DISK_DEVICE bs=512 count=1 oflag=direct status=none
        echo "Refreshed lease"
        sleep 5
    done
}

LEASE=$(dd if=$DISK_DEVICE bs=512 count=1 iflag=direct status=none)

if [[ $LEASE != $MAGIC* ]]; then
    # new disk, no lease
    acquire_lease
else
    last_active_remote=-1
    last_active_local=-1
    while [[ $SHUTDOWN -eq 0 ]]; do
        now=$(date +%s)
        read -r magic timestamp holder < <(echo $LEASE)
        if [ "$last_active_remote" != "$timestamp" ]; then
            echo "Remote $holder refreshed the lease"
            last_active_remote=$timestamp
            last_active_local=$now
        elif (($now - $last_active_local > 10)); then
            echo "Remote is dead, preempting"
            acquire_lease
            break
        fi
        sleep 5
        LEASE=$(dd if=$DISK_DEVICE bs=512 count=1 iflag=direct status=none)
    done
fi

The YAML in the steps below applies to the C version. For the Bash version, add the following to your container's securityContext:

securityContext:
  capabilities:
    add: ["SYS_ADMIN"]

Expand to view the Dockerfile

Dockerfile for the C version:

# syntax=docker/dockerfile:1.4

FROM buildpack-deps:bookworm as builder

COPY lease.c /usr/src/nvme-resv/
RUN gcc -o /lease -O2 -Wall /usr/src/nvme-resv/lease.c

FROM debian:bookworm-slim

COPY --from=builder --link /lease /usr/local/bin/lease
ENTRYPOINT ["/usr/local/bin/lease"]

Dockerfile for the Bash version:

# syntax=docker/dockerfile:1.4
FROM debian:bookworm-slim

RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
    --mount=type=cache,target=/var/lib/apt,sharing=locked \
    sed -i 's/deb.debian.org/mirrors.aliyun.com/g' /etc/apt/sources.list.d/debian.sources && \
    rm -f /etc/apt/apt.conf.d/docker-clean && \
    apt-get update && \
    apt-get install -y nvme-cli

COPY --link lease.sh /usr/local/bin/lease
ENTRYPOINT ["/usr/local/bin/lease"]

Step 1: Deploy the application and configure multi-attach

Create a StorageClass that enables multi-attach, a PVC configured as a block device, and a StatefulSet that uses the lease application image.

  1. Create a file named lease.yaml with the following content. Replace the container image address with your actual image address.

    Important
    • NVMe Reservation takes effect at the node level. If multiple pods run on the same node, they can interfere with each other. This example uses podAntiAffinity to prevent that.

    • If your cluster has nodes that do not use the NVMe protocol, configure node affinity to restrict scheduling to NVMe-capable nodes.

    Expand to view the lease.yaml file

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: alicloud-disk-shared
    parameters:
      type: cloud_essd # Currently supports cloud_essd, cloud_auto, and cloud_regional_disk_auto
      multiAttach: "true"
    provisioner: diskplugin.csi.alibabacloud.com
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: data-disk
    spec:
      accessModes: [ "ReadWriteMany" ]
      storageClassName: alicloud-disk-shared
      volumeMode: Block
      resources:
        requests:
          storage: 20Gi
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      name: lease-test
    spec:
      replicas: 2
      serviceName: lease-test
      selector:
        matchLabels:
          app: lease-test
      template:
        metadata:
          labels:
            app: lease-test
        spec:
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: app
                    operator: In
                    values:
                    - lease-test
                topologyKey: "kubernetes.io/hostname"
          containers:
          - name: lease
            image: <IMAGE OF APP>   # Replace with the image address of your application.
            volumeDevices:
            - name: data-disk
              devicePath: /dev/data-disk
          volumes:
          - name: data-disk
            persistentVolumeClaim:
              claimName: data-disk

    The following table summarizes the key differences between multi-attach and standard mounting configurations:

    Resource Field Multi-attach Standard mounting
    StorageClass parameters.multiAttach "true" Not required
    PVC accessModes ReadWriteMany ReadWriteOnce
    PVC volumeMode Block Filesystem
    Volume mounting Method volumeDevices — direct block device access volumeMounts — file system mount
  2. Deploy the application:

    kubectl apply -f lease.yaml

Step 2: Verify multi-attach and Reservation

Verify that multiple nodes can read and write to the same disk

Run the following command to view pod logs:

kubectl logs -l app=lease-test --prefix -f

Expected output:

[pod/lease-test-0/lease] Register as key 4745d0c5cd9a2fa4
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
[pod/lease-test-0/lease] Refreshed lease
[pod/lease-test-1/lease] Remote lease-test-0 refreshed lease

lease-test-1 immediately reads the data written by lease-test-0, confirming multi-attach is working.

Verify that NVMe Reservation is active

  1. Get the cloud disk ID:

    kubectl get pvc data-disk -ojsonpath='{.spec.volumeName}'
  2. Log in to either of the two nodes and run the following command. Replace 2zxxxxxxxxxxx with the portion after d- in the disk ID from the previous step.

    nvme resv-report -c 1 /dev/disk/by-id/nvme-Alibaba_Cloud_Elastic_Block_Storage_2zxxxxxxxxxxx

    Expected output:

    NVME Reservation status:
    
    gen       : 3
    rtype     : 1
    regctl    : 1
    ptpls     : 1
    regctlext[0] :
      cntlid     : ffff
      rcsts      : 1
      rkey       : 4745d0c5cd9a2fa4
      hostid     : 4297c540000daf4a4*****

    rtype: 1 (write exclusive) and regctl: 1 confirm that NVMe Reservation is active.

Verify that Reservation blocks writes from a failed node

  1. Log in to the node running lease-test-0 and pause the process to simulate a failure:

    pkill -STOP -f /usr/local/bin/lease
  2. Wait 30 seconds, then check the logs:

    kubectl logs -l app=lease-test --prefix -f

    Expected output:

    [pod/lease-test-1/lease] Remote lease-test-0 refreshed lease
    [pod/lease-test-1/lease] Remote is dead, preempting
    [pod/lease-test-1/lease] Register as key 4745d0c5cd9a2fa4
    [pod/lease-test-1/lease] Refreshed lease
    [pod/lease-test-1/lease] Refreshed lease
    [pod/lease-test-1/lease] Refreshed lease

    lease-test-1 has preempted the lease and taken over as the primary.

  3. Resume the paused process on the lease-test-0 node:

    pkill -CONT -f /usr/local/bin/lease
  4. Check the logs again:

    kubectl logs -l app=lease-test --prefix -f

    Expected output:

    [pod/lease-test-0/lease] failed to write lease: Invalid exchange

    lease-test-0 can no longer write to the disk. The Invalid exchange error confirms that Reservation has successfully blocked the write I/O from the formerly primary node, and the lease container restarts automatically.

What's next

If your NVMe cloud disk runs out of space, see Expand a cloud disk volume.