All Products
Search
Document Center

Elastic Compute Service:NVMe protocol

Last Updated:Mar 21, 2024

Non-Volatile Memory Express (NVMe) is a logical device interface specification and a transport protocol that serves like the Advanced Host Controller Interface (AHCI) and leverages the PCI Express (PCIe) bus for accessing non-volatile memory. This topic describes the terms and use scenarios of NVMe.

Terms

Term

Description

Benefit

NVMe

NVMe is a logical device interface specification and a transport protocol that defines a command set and a feature set for PCIe SSDs to boost performance and productivity and provide interoperability between various enterprise-class systems and client systems.

NVMe is designed for SSDs and offers an efficient interface to speed up communication between SSDs and CPUs. It delivers faster response time and higher bandwidth than traditional driver protocols such as Small Computer System Interface (SCSI) and virtio-blk. NVMe is evolving into a new industry standard for data center servers and clients. Alibaba Cloud provides NVMe-capable enhanced SSDs (ESSDs), also called NVMe disks. NVMe disks are high-performing and provide enterprise-class features. Each NVMe disk can be attached to multiple NVMe-capable Elastic Compute Service (ECS) instances for data sharing.

Multi-attach

Multi-attach is a feature that allows you to attach a single NVMe disk to multiple ECS instances that reside in the same zone. In this case, the ECS instances can concurrently perform read and write operations on the NVMe disk.

Multi-attach allows data sharing among instances to help reduce storage costs, increase fault tolerance, and improve business scalability without the need to move data. Multi-attach is suitable for scenarios such as high-availability databases, distributed database clusters that each consists of one write node and multiple read-only nodes, distributed cache, and machine learning acceleration.

Persistent Reservation (PR)

The NVMe protocol provides PR commands to control client access to disks. PR Register, Acquire, Release, and Report commands are available and can be used to register, acquire, remove, and query permissions. You can run these commands to configure different permissions on disks for clients to improve data reliability and security. For more information, see NVM Express Base Specification.

In multi-attach scenarios, data may be corrupted when multiple clients perform concurrent writes to the same disk. PR provides precise control on the read and write permissions of disks to ensure that data is written from compute nodes as expected. For example, when a failover occurs, PR can prevent data from being written to the failed node and ensure data correctness on the new node to which your business is failed over.

Shared NVMe disk

A shared NVMe disk is a disk that supports the multi-attach and PR features based on NVMe. Each shared NVMe disk can be attached to up to 16 ECS instances at the same time.

Shared NVMe disks are suitable for high-availability databases and distributed database clusters that each consists of one write node and multiple read-only nodes. Shared NVMe disks are also a go-to storage option for cloud migration of traditional SAN-based high-availability business, such as Oracle RAC, SAP HANA, and cloud native databases.

Cluster file system

A cluster file system is attached to multiple nodes at the same time when the multi-attach feature is used. All of the nodes to which a cluster file system is attached have access to data in the file system. New data, new files, and changes to file system metadata are synchronized to these nodes in real time. This way, data remains consistent at the file system level across the nodes.

Typically, traditional ext3 and ext4 file systems cache data and metadata to accelerate access performance. As a result, new data, new files, or disk space information on each node is cached locally and cannot be perceived by other nodes in real time. To resolve this issue, cluster file systems are introduced. General-use cluster file systems include Oracle Cluster File System version 2 (OCFS2) and Oracle Database Filesystem (DBFS).

Use scenarios

The NVMe protocol is used for NVMe disks and shared NVMe disks.

NVMe disks

NVMe is evolving into a new industry standard. An ever-growing number of applications are built on NVMe SSDs. ESSDs that support NVMe are called NVMe disks. NVMe disks provide the same read and write interfaces as NVMe SSDs and can work seamlessly with traditional NVMe SSD-based applications that are migrated to the cloud. NVMe disks benefit from the scalability, zero O&M, and high performance of cloud resources and cloud services such as snapshots. For more information, see NVMe disks.

Shared NVMe disks

When you create an ESSD, you can enable the multi-attach feature for the ESSD. Disks that support NVMe and have the multi-attach feature enabled are called shared NVMe disks. For more information, see Enable multi-attach.

Shared NVMe disks are suitable for high-availability, high-concurrency, and scalable business and help implement cloud migration of traditional SAN-based business. The common use scenarios of shared NVMe disks include data sharing, high-availability failover, distributed cache acceleration, and model training in machine learning.

image
  • Data sharing

    Data sharing is the simplest use scenario of NVMe. For example, a single NVMe-capable container image in the cloud can be read and loaded by multiple instances. After data is written to a shared NVMe disk from one of its attachment nodes, all of the other attachment nodes have access to the data. This way, storage costs are reduced and read/write performance is improved.

    image
  • High-availability failover

    Shared NVMe disks are used to ensure the high availability of traditional SAN-based databases, such as Oracle RAC, SAP HANA, and cloud native databases. In practice, business is vulnerable to single points of failure (SPOFs). Shared NVMe disks can be used to ensure business continuity and high availability in terms of cloud-based storage and networks in case of SPOFs. Compute nodes suffer frequent outages, downtime, and hardware failures. You can deploy business in primary/secondary mode to achieve high availability of compute nodes.

    For example, when a primary database instance fails, business can be immediately failed over to the secondary database instance. Then, you can run an NVMe PR command to remove the write permissions on the failed database instance so that data is no longer written to the instance. The following figure shows the failover process.

    1. The primary database instance (Database Instance 1) fails and stops to provide services externally.

    2. Run an NVMe PR command to prevent data from being written to Database Instance 1 and allow data to be written to the secondary database instance (Database Instance 2).

    3. Restore Database Instance 2 to the same state as Database Instance 1 by using different methods such as log replay.

    4. Database Instance 2 takes over as the primary database instance to provide services externally.

    image
  • Distributed cache acceleration

    Shared NVMe disks deliver high performance, IOPS, and throughput and can provide performance acceleration for slow- or medium-speed storage systems. For example, data lakes are typically built on top of Object Storage Service (OSS). Each data lake can be simultaneously accessed by multiple clients. It delivers high sequential read throughput, high additional write throughput, but has low sequential read/write throughput, high latency, and poor random read/write performance. Shared NVMe disks between compute and storage sides can be accelerated to speed up cache and significantly improve access performance for data lakes.

    image
  • Machine learning

    In machine learning scenarios, after a sample is labelled and written, the sample is split and distributed across multiple nodes to implement distributed computing of neutral network models. When GPUs are used as compute resources for high-performance machine learning, slow storage may become a bottleneck. If this occurs, you can use shared NVMe disks to accelerate the performance of the entire model training.

    image