Alibaba Cloud NVMe disk and shared storage

How is 7x24 high availability refined?

In the real world, single point of failure is normal, and ensuring business continuity under failure is the core capability of a highly available system. How to ensure business 7 * 24 high availability in key applications such as finance, insurance, and government affairs? Generally speaking, the business system consists of computing, network and storage. On the cloud, network multipath and storage distribution ensure stable and high availability. However, to achieve full link high availability of the business, it is also necessary to solve the single point of failure on the computing and business side. Take a common database as an example. It is difficult for users to accept that a single point of failure causes a business stop. How to quickly restore business when the instance is unavailable due to power failure, downtime, hardware failure, etc?

The solutions in different scenarios are different. MySQL usually builds a master-slave/master-slave architecture to achieve high availability of the business. When the master database fails, it switches to the slave database to continue to provide external services. But how to ensure the consistency of master slave database data after instance switching? According to the tolerance of the business to data loss, MySQL usually uses synchronous or asynchronous data replication, which introduces additional problems: data loss in some scenarios, synchronous data affecting system performance, business expansion requires adding a whole set of equipment and full data replication, and long active/standby switching time affecting business continuity. It can be seen that in order to build a highly available system, the architecture will become complex, and it is difficult to take into account the availability, reliability, scalability, cost, performance, etc. Is there a more advanced solution to get both fish and bear's paw? The answer must be: Yes!

Through shared storage, different database instances can share the same data, thus obtaining high availability through fast switching of computing instances (Figure 1). Oracle RAC, AWS Aurora, and Alibaba Cloud PolarDB databases are among the representatives. The key here is shared storage. Traditional SANs are expensive, difficult to expand and shrink, and the machine head is also easy to become a bottleneck. The high threshold of use is unfriendly to users. Is there a better, faster, and more economical shared storage to solve users' pain points? The recently launched NVMe cloud disk and sharing features of Alibaba Cloud will fully meet the needs of users. Next, we will focus on them. Here is a question for you. After the instance switch, if the original database is still writing data, how to ensure the data correctness? The reader can think about it first.

Wheel of history: SAN on cloud and NVMe

We have entered the digital economy era of "data oil". The rapid development of cloud computing, artificial intelligence, Internet of Things, 5G and other technologies has promoted the explosive growth of data. It can be seen from the IDC 2020 report that the global data scale will grow year by year, reaching 175 ZB in 2025, and the data will be mainly concentrated in the public cloud and enterprise data center. The rapid growth of data has provided new impetus and requirements for the development of storage. Let's recall how the next block storage form evolves step by step.

DAS: The storage device is connected to the host through direct connection (SCSI, SAS, FC and other protocols). The system is simple, easy to configure and manage, and low in cost. Because the storage resources cannot be fully used and shared, it is difficult to achieve centralized management and maintenance.

SAN: The storage array and business host are connected through a dedicated network, which solves the problems of unified management and data sharing, and enables high-performance and low latency data access. However, SAN storage is expensive, complicated in operation and maintenance, and poor in scalability, raising the threshold for users.

Full flash: The revolution in the underlying storage media and the decline in cost mark the arrival of the full flash era. Since then, the storage performance has shifted to the software stack, forcing software to undergo large-scale changes, promoting the rapid development of user mode protocols, software hardware integration, RDMA and other technologies, and bringing about a leap in storage performance.

Cloud disk: During the rapid development of cloud computing, storage is transferred to the cloud. Cloud disk has inherent advantages: flexibility, flexibility, ease of use, easy expansion, high reliability, large capacity, low cost, and operation and maintenance free. It has become a solid base for storage in the process of digital transformation.

SAN on cloud: In order to support storage business in all aspects and replace traditional SAN storage, SAN on cloud came into being in response to the times. It inherits many advantages of cloud disk and also has traditional SAN storage capabilities, including shared storage, data protection, synchronous/asynchronous replication, extreme snapshots and other features, and will continue to shine in the enterprise storage market.

On the other hand, NVMe is becoming the darling of the new era in the evolution of storage protocols.

SCSI/SATA: In ancient times, hard disks were mostly low-speed devices. Data was transmitted through SCSI layer and SATA bus, and the performance was limited by the storage of low-speed media, such as mechanical hard disks, which covered the performance disadvantage of SATA single channel and SCSI software layer.

VirtIO-BLK/VirtIO-SSCSI: With the rapid development of virtualization technology and cloud computing, VirtIO-BLK/VirtIO-SSCSI has gradually become the mainstream storage protocol of cloud computing, making the use of storage resources more flexible, agile, secure and scalable.

NVMe/NVMe oF: The development and popularization of flash memory technology has promoted a new generation of storage technology revolution. When the storage medium is no longer a barrier to performance, the software stack has become the biggest bottleneck, which has spawned various high-performance lightweight agreements such as NVMe/NVMe oF, DPDK/SPDK, and user mode network. NVMe protocol family has high performance, advanced features, and high scalability, and will certainly lead the new era of cloud computing.

In the foreseeable future, SAN on cloud and NVMe will become the trend of the future, which is the general trend.

Cloud Disk New Era NVMe

With the rapid development and popularization of flash memory technology, the performance bottleneck has been transferred to the software side. More requirements for storage performance and functions have pushed NVMe onto the historical stage. NVMe has specially designed data access protocols for high-performance devices. Compared with traditional SCSI protocols, NVMe is simpler and lighter. With multi queue technology, it can greatly improve storage performance. At the same time, NVMe also provides a wealth of storage features. Since its birth in 2011, the NVMe standard has standardized many advanced functions such as multiple Namespaces, Multi Path, full link data protection T10-DIF, Persistent Revolution permission control protocol, atomic write, etc. through continuous improvement of the protocol. Its new storage features will continue to help users create value.

The high performance and rich features of NVMe provide a solid foundation for enterprise storage. In addition, the scalability and growth of the protocol itself have become the core driving force for the evolution of NVMe cloud disk. NVMe cloud disk is based on ESSD. It inherits the high reliability, high availability, high performance, atomic write and other capabilities of ESSD, as well as the enterprise features of ESSD such as native snapshot data protection, cross domain disaster recovery, encryption, and second level performance change. The integration of ESSD and NVMe features can effectively meet the needs of enterprise applications, enabling most NVMe and SCSI based businesses to seamlessly go to the cloud. The shared storage technology described in this article is implemented based on the NVMe Persistent Reservation standard. As one of the additional functions of NVMe cloud disk, its multi mount and IO Fencing technology can help users significantly reduce storage costs, and effectively improve business flexibility and data reliability. It is widely used in distributed business scenarios, especially for Oracle RAC, SAP Hana and other highly available database systems.

Enterprise storage edge: shared storage

As mentioned earlier, shared storage can effectively solve the problem of high availability of databases. Its main capabilities include multiple mounts and IO Fencing. Taking databases as an example, we will describe how they work.

The key to high availability of business -- multiple mounts

Multiple mounting allows a cloud disk to be mounted to multiple ECS instances at the same time (currently up to 16 are supported), and all instances can access the cloud disk read-write (Figure 6). Through multiple mounts, multiple nodes can share the same data, which can effectively reduce storage costs. When a single node fails, the business can quickly switch to a healthy node. This process does not require data replication, providing atomic capabilities for rapid recovery from failures. Highly available databases such as Oracle RAC and SAP HANA rely on this feature to achieve. It should be noted that shared storage provides data layer consistency and recovery capability. To achieve final business consistency, businesses may need to conduct additional processing, such as database log replay.

Generally, a stand-alone file system is not suitable for multiple mount file systems. In order to speed up file access, ext4 and other file systems cache data and metadata. The file modification information cannot be synchronized to other nodes in a timely manner, resulting in inconsistent data between multiple nodes. The unsynchronization of metadata will also lead to conflicts between nodes in accessing hard disk space, leading to data errors. Therefore, multiple mounts usually need to be used with the cluster file system, such as OCFS2, GFS2, GPFS, Veritas CFS, Oracle ACFS, etc. Alibaba Cloud DBFS and PolarFS also have this capability.

With multiple mounts, can you rest easy? Multi mount is not everything. It has its own blind spot that cannot be solved: permission management. Usually, applications based on multiple mounts need to rely on the cluster management system to manage permissions, such as Linux Pacemaker. However, in some scenarios, permission management will fail, leading to serious problems. Recall the problem raised at the beginning of the article. Under the highly available architecture, the primary instance will switch to the standby instance after an exception occurs. If the primary instance is in a suspended state (such as network partition, hardware failure, and other scenarios), it will mistakenly think that it has write permission, so that it can write dirty data with the standby instance. How to avoid this risk? At this time, it's IO Fencing's turn.

Data correctness assurance -- IO Fencing

One of the options to solve the problem of dirty data writing is to terminate the in transit request of the original instance, reject the new request and continue to issue, and switch the instance after ensuring that the old data is no longer written. Based on this idea, the traditional solution is STONITH (shoot the other node in the head), that is, to prevent old data from being downloaded by restarting the failed machine remotely. However, there are two problems with this scheme. First, the restart process is too long, and the business switching is slow, which usually results in a business stop of tens of seconds to minutes. More seriously, since the IO path on the cloud is long and involves many components, component failures (such as hardware and network failures) of the computing instance are likely to cause IO to fail to recover in a short time, so the accuracy of the data cannot be guaranteed 100%.

In order to fundamentally solve this problem, NVMe has standardized the Persistent Reservation (PR) capability, which defines the permission configuration rules for NVMe cloud disks, and can flexibly modify the permissions of cloud disks and mount nodes. In this scenario, when the master database fails, the slave database first sends a PR command to prohibit the master database from writing to the master database, and rejects all in transit requests from the master database. At this time, the slave database can update data without risk (Figure 7). IO Fencing can usually assist applications to complete failover at the millisecond level, greatly reducing the failure recovery time. The smooth migration of services makes the upper layer applications almost imperceptible, which is a qualitative leap over STONITH. Next, we will further introduce the permission management technology of IO Fencing.

Swiss Army Knife for Permission Management -- Persistent Reservation

The NVMe Persistent Reservation (PR) protocol defines cloud disk and client permissions. With multiple mounting capabilities, it can switch services efficiently, safely and stably. In the PR protocol, the mount node has three identities, namely, Holder, Registrar, and Non Registrar. It can be seen from the name that the owner has all the permissions of the cloud disk, the registrant has some permissions, and the visitor only has read permissions. At the same time, the cloud disk has six sharing modes, which can achieve exclusive, write once, read many, and write many capabilities. By configuring the sharing mode and role identity, you can flexibly manage the permissions of each node (Table 1), so as to meet the needs of rich business scenarios. NVMe PR inherits all the capabilities of SCSI PR, and all applications based on SCSI PR can run on NVMe shared cloud disks with a few changes.

Table 1: NVMe Persistent Reservation Permission Table

Multiple mounts, combined with IO Fencing capabilities, can perfectly build a highly available system. In addition, NVMe shared disks can also provide the ability to write once and read many times, and are widely used in databases with separate reads and writes, machine learning model training, streaming processing and other scenarios. In addition, image sharing, heartbeat detection, arbitration, lock mechanism and other technologies can be easily realized through sharing cloud disks.

Unveiling NVMe Cloud Disk Technology

NVMe cloud disk is implemented based on the computing storage separation architecture. It realizes efficient NVMe virtualization and extremely fast IO paths relying on the DPCA hardware platform. It uses Pangu 2.0 storage as the base to achieve high reliability, high availability and high performance. Computing storage is interconnected through user mode network protocols and RDMA. NVMe cloud disk is the crystallization of the whole stack of high-performance and high availability technologies (Figure 9).

NVMe hardware virtualization: NVMe hardware virtualization technology is created on the DPCA MOC platform, and efficient interaction between data flow and control flow is conducted through Send Queue (SQ) and Completion Queue (CQ). Simple NVMe protocol plus efficient design, together with hardware offload technology, reduces NVMe virtualization latency by 30%.

Extremely fast IO channel: Based on the DPCA MoC software hardware integration technology, the extremely fast IO channel is realized, which effectively shortens the IO channel, thus obtaining the ultimate performance.

User mode protocol: NVMe uses a new generation of Solar-RDMA user mode network communication protocol, combined with Leap CC self-developed congestion control to achieve reliable data transmission and reduce the network long tail delay. JamboFrame based on 25G network realizes efficient large packet transmission, simplifies the network software stack and improves performance through the comprehensive separation of data plane and control plane. Network multi-path technology supports millisecond recovery of link failure.

Control high availability: With Pangu 2.0 distributed high availability storage, the NVMe control center is realized, and NVMe control commands no longer pass through the control node, thus obtaining reliability and availability close to IO, which can help users achieve millisecond level business switching; Based on NVMe control center, accurate flow control between multiple clients and servers is realized, and accurate distributed flow control for IO is realized in sub second level; The IO Fencing consistency for multiple nodes is realized on the distributed system. The permissions between cloud disk partitions are kept consistent through two-phase updates, which effectively solves the problem of partition permissions.

Large IO atomicity: Based on the distributed system, the atomic write ability of large IO is realized from computing, network, and storage end to end. Under the condition that IO does not cross adjacent 128K boundaries, it ensures that the same data will not be partially downloaded. This plays an important role in database and other application scenarios that rely on atomic write. It can effectively optimize the database double write process, thereby greatly improving the database write performance.

Current status and future prospects

It can be seen that the current NVMe cloud disk combines the most advanced software and hardware technologies in the industry. In the cloud storage market, it is the first to achieve NVMe protocol+shared access+IO Fencing technology at the same time. It has obtained high reliability, high availability and high performance on the basis of ESSD. At the same time, it has realized rich enterprise features based on NVMe protocol, such as multiple mounting, IO Fencing, encryption, offline capacity expansion, native snapshot, asynchronous replication and other functions.

At present, NVMe cloud disk and NVMe shared disk have been invited to test, and have obtained preliminary certification from Oracle RAC, SAP HANA and internal database team. Next, it will further expand the scope of public test and commercialize. In the foreseeable future, we will gradually continue to evolve around NVMe cloud disk to better support online expansion, full link data protection T10-DIF, cloud disk multiple namespaces and other advanced features, so as to evolve a comprehensive cloud SAN capability. Please wait!

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us