By Wen Gu from Alibaba Cloud, core member of network SIG in the OpenAnolis community.
Jingxuan Li from Alibaba Cloud, owner of network SIG in the OpenAnolis community.
Yang Lu from Alibaba Cloud, core member of network SIG in the OpenAnolis community.
During Apsara Conference 2021, Alibaba Cloud released the fourth-generation X-Dragon architecture. Its elastic RDMA acceleration capability deserves our special attention. Based on elastic RDMA, Alibaba Cloud Linux 3 and Anolis OS, the OpenAnolis community's operating system, formed socket-compatible RDMA product solutions by optimizing networks based on community Shared Memory Communications over RDMA (SMC-R). It aims to help cloud applications enjoy the performance improvement brought by RDMA without modification. This article describes the background, principle, architecture, and some performance data of SMC-R.
In recent years, virtualization network on the cloud has made a qualitative leap in performance with the rapid development of cloud computing, especially the emergence of hardware virtualization solutions, such as Alibaba Cloud X-Dragon and AWS Nitro. Compared with the thriving cloud network, the performance improvement of the CPU is sort of sluggish. Therefore, the solution of offloading part of the work originally undertaken by the CPU to the Data Processing Unit (DPU) has become a research focus in cloud computing. Remote Direct Memory Access (RDMA) is a technology that pursues extreme performance. It was once mainly used in specific scenarios, such as high-performance computing and high-frequency trading, but now it has entered the data centers of cloud vendors.
RDMA is a remote direct memory access technology that bypasses kernel. It is widely used in data- and compute-intensive scenarios and is suitable for a wide array of fields, such as high-performance computing, machine learning, data centers, and mass storage. RDMA has been running stably on a large scale in Alibaba for many years, supporting core businesses, such as Alibaba Cloud ESSD and PolarDB. Its reliability has been fully verified in major scenarios, such as Double 11.
RDMA has the zero-copy and protocol stack offload features. It offloads protocol stacks to RDMA Network Interface Controllers (RNICs) and performs direct memory access without involving the kernel. RDMA reduces CPU resources required for protocol processing and data copy and delivers lower latency for network communication and higher throughput than traditional TCP/IP implementations by bypassing the protocol stack of the system kernel.
In the past, RDMA was used only with NICs and switches in some data centers and was complex to deploy. Now, with Alibaba Cloud elastic RDMA, the once complex RDMA technology is easy to acquire, allowing common ECS users to use high-performance RDMA transmission without worrying about the underlying complexity of NICs, switches, and other physical network environment configurations. These features make it a friendly technology that benefits all.
The performance of RDMA is great, but RDMA uses the IB verbs interface. This IB verbs interface is more complex and very different from the commonly used POSIX socket interface. Ordinary applications have to undergo a lot of business transformations to use RDMA. There is a high technical threshold for efficiently applying RDMA to existing businesses.
Therefore, there have been some attempts to encapsulate the IB verbs semantics of RDMA into socket interfaces, such as rsocket and libvma. Among them, libvma intercepts the socket interface through LD_PRELOAD and uses user-mode verbs instead to complete the data transmission. However, these practices have some defects: since the conversion occurs in the user mode, there is a lack of unified resource management of the kernel, and there are also some compatibility issues.
From the perspective of resource management and compatibility, implementing socket interfaces in the kernel has natural advantages over user mode. In Alibaba Cloud Linux 3 and Anolis OS, we provided and optimized Shared Memory Communications over RDMA (SMC-R). This is an attempt to achieve TCP application compatibility based on kernel RDMA.
The native SMC-R supports the standard RoCE network. We have extended it and achieved support for iWARP for the first time. It can perfectly support Alibaba Cloud's in-house elastic RDMA so cloud applications can enjoy the performance improvement brought by RDMA with zero modification.
In summary, SMC-R is a protocol family that compatible with TCP socket interfaces but using RDMA to do real data transmission.
SMC-R works in the kernel, between the socket layer and the IB verbs layer of the kernel RDMA. SMC-R is an excellent translator and manager. It receives socket instructions from users and uses RDMA's IB verbs interfaces to manage RDMA resources to complete the underlying RDMA-based data transmission. Therefore, the user only needs to change the protocol family used by the original socket interface from AF_INET(6) to AF_SMC(6) to complete the transition from TCP protocol stack to SMC-R protocol stack.
However, this is not enough. We hope to complete the protocol replacement without modification. To this end, Alibaba Cloud Linux 3 and Anolis OS added protocol family replacement-related function and whitelists at the socket layer. It also provides TCP-to-SMC-R transparent replacement capabilities of the protocol stack for net namespace and a single application. This enables applications to transmit data on the RDMA highway without any modification.
However, only local use of SMC-R is not enough. Remote nodes also need SMC-R capabilities to make the highway come into play. Therefore, SMC-R has the automatic negotiation and secure fallback-to-TCP mechanisms.
SMC-R first establishes a TCP connection to the remote node in the kernel to establish an RDMA connection. During the handshake, the local node uses specific TCP options to indicate its support for SMC-R and verifies that the remote node also supports SMC-R. After the negotiation is successful, SMC-R will apply for necessary RDMA resources for the user-mode applications and register the buffer that receives data as Remote Memory Buffer (RMB). This way, it can be directly accessed by the remote node. Then, SMC-R encapsulates the corresponding access key and buffer start address into a remote access token (RToken) and informs the remote node, as there is important verification information for it to access RMB.
In special cases, if the sending end or the receiving end does not have the RDMA transfer condition during negotiation, the security fallback mechanism is triggered. SMC-R will use the TCP connection established during negotiation to complete subsequent data transmission to ensure the reliability and stability of the network.
After the data transmission highway is established, traffic rules need to be formulated properly. As its name implies, SMC-R is a communication method that realizes shared memory through RDMA. It uses the ring RMB as shared memory and cooperates with the data cursor to realize efficient data transmission.
SMC-R performs RDMA Write operations to write the data transmitted to the kernel by web applications to the ring RMB of the remote node. At the same time, it performs RDMA Send and Receive operations to send and receive Connection Data Control (CDC) messages for updating and synchronizing the cursors in RMBs. For the RMB on one side, the reading peer updates its consumer cursor to indicate the address of the next byte of data to be consumed. The writing peer does not write data to the RMB beyond the consumer cursor to prevent any data loss. Similarly, the writing peer updates its producer cursor to indicate the address of the next byte of data to be written. The reading peer does not read data from the RMB beyond the producer cursor to ensure data correctness. The two data cursors catch up with each other on the ring RMB and exist during the whole data transmission process to ensure the safety and reliability of transmission.
The data will inevitably encounter errors when speeding on the highway. Alibaba Cloud Linux 3 and Anolis OS provide a series of monitoring and diagnosis interfaces and tools for SMC-R to ensure everything is under control. These include sysctl for controlling transparent replacement, proc files for querying SMC-R socket status, and smc-tools toolsets for obtaining information on each dimension of SMC-R. Monitoring and easy operation and maintenance (O&M) of the network can be ensured with these interfaces and tools.
We can see SMC-R demonstrated its capabilities to be compatible with socket interfaces to transparently replace TCP and use RDMA to complete the underlying data transmission through the architecture overview and theoretical analysis above. Based on these, we can sum up SMC-R's core benefits:
Based on the introduction above, we know that one of the main advantages of SMC-R is to provide better performance while ensuring compatibility, but how is the performance? Let's take a look at how SMC-R works with Alibaba Cloud elastic RDMA in several typical application scenarios.
SMC-R and elastic RDMA help Redis increase QPS by 50% on average for different data sizes and 20% on average for different numbers of clients in latency-sensitive data query and processing scenarios.
SMC-R and elastic RDMA help thrift increase QPS by about 30% on average and help netty increase QPS by about 12% on average in high-performance RPC scenarios.
We can clearly see in the experimental data above that SMC-R has impressive performance in latency-sensitive data queries and high-performance RPC scenarios. In addition to the experimental scenarios above, SMC-R can be applied to large-scale data interaction in clusters, large file transfer with high-throughput, and other scenarios.
Of course, SMC-R is not all-purpose. Some defects of RDMA also exist in SMC-R. In particular, RDMA's performance is far inferior to TCP in terms of connection creation performance because it needs to interact with hardware. Therefore, in the scenario of a large number of short connections, SMC-R performance is not as good as TCP. SMC-R needs to allocate memory in advance for each connection, and when the number of connections is very large, its memory usage will be higher than TCP. In addition, SMC-R works as a communication method used within the data center and is not suitable for being exposed to the public network.
In the future, we hope to continue optimizing SMC-R to benefit cloud applications. To this end, we will open-source the code on the high-performance network SIG in the OpenAnolis community and continue improving it. At the same time, we will publish SMC-R-related documents on the Alibaba Cloud Linux 3 official website. We believe that in the near future, SMC-R can help more users to transform applications without intrusion and enjoy the network performance improvement brought by RDMA.
OpenAnolis - May 13, 2022
Alibaba Cloud ECS - September 10, 2020
Alibaba Clouder - May 27, 2020
Alibaba Clouder - August 12, 2021
Alibaba Cloud Community - March 22, 2022
Alibaba Developer - January 9, 2020
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
A HPCaaS cloud platform providing an all-in-one high-performance public computing serviceLearn More
More Posts by OpenAnolis