Alibaba Cloud Linux 3 provides Shared Memory Communication (SMC), a high-performance network protocol that functions in kernel space. SMC utilizes shared memory technology and works with socket interfaces to establish network communications. SMC is classified into the following types based on shared memory technology: Shared Memory Communications - Direct Memory Access (SMC-D) and Shared Memory Communications over Remote Direct Memory Access (SMC-R). SMC-D uses internal shared memory (ISM) technology, and SMC-R uses remote direct memory access (RDMA) technology. This topic describes SMC-R and how to use it.
Background information
IBM open-sourced SMC-R for Linux 4.11 in 2017 and has been maintaining it until now. For more information about SMC-R, see RFC 7609. Alibaba Cloud Linux 3 leverages Alibaba Cloud Elastic RDMA (eRDMA) to unlock the use of SMC-R in the cloud. SMC-R can transparently replace TCP in applications without loss of functionality and deliver high-performance, hardware-software co-designed networks that are accessible to all users.
The shared memory-based data exchange model of SMC-R relies on the atomic memory operations provided by RDMA. RDMA implements protocol stacks in RDMA network interface cards (RNICs) to allow network nodes to bypass kernel and directly access remote memory. Compared with traditional TCP networks, RDMA networks reduce memory-to-memory copies and consume less CPU resources in data transfers to provide low-latency, high-throughput communications. The following figure shows the differences between TCP/IP stacks and RDMA stacks.
RDMA is widely used in data-intensive and compute-intensive scenarios and is suitable for multiple fields, such as high-performance computing, machine learning, data centers, and massive storage.
In the past, RDMA was used only in some data centers together with network cards and switches and was complex to deploy. Alibaba Cloud eRDMA brings RDMA to the cloud. This allows all Elastic Compute Service (ECS) users to use RDMA to transmit data without the need to make complex configurations for the underlying physical network environment, such as network cards and switches.
RDMA relies on InfiniBand (IB) verbs interfaces, which are significantly different from traditional POSIX socket interfaces. Existing socket applications must be greatly transformed before they can be migrated to RDMA networks. A high level of technical expertise is required to apply RDMA.
To utilize eRDMA and deliver higher network performance, Alibaba Cloud Linux 3 provides optimized SMC-R. Optimized SMC-R utilizes RDMA in an efficient manner and is compatible with standard TCP applications. This helps improve the performance of more applications without modifications.
Benefits
SMC-R has the following benefits:
High performance
RDMA offloads protocol stacks from kernel to network cards. This equips SMC-R with lower network latency, higher throughput, and lower CPU utilization over traditional TCP stacks in specific scenarios.
Hardware offloading
Reliable and efficient direct access to remote memory
Transparent replacement
SMC-R is compatible with POSIX socket interfaces and provides transparent replacement features to allow socket applications to switch from TCP stacks to SMC-R stacks without modifications or further development.
SMC-R can call socket interfaces to provide shared memory communications.
SMC-R enables multi-level transparent replacement of protocol stacks without functionality loss.
SMC-R provides the automatic negotiation and secure fallback mechanisms.
Architecture
SMC-R architecture:
Protocol hierarchy and transparent replacement
SMC-R functions in kernel space and supports the network behaviors that user-mode programs describe by using socket interfaces. SMC-R also performs RDMA transmission by using IB verbs interfaces. SMC-R stacks are responsible for using, managing, and maintaining RDMA resources. Applications are not affected by the RDMA entities in kernel. The following figure shows the architecture of SMC-R.
Alibaba Cloud Linux 3 provides a mechanism that can be used to transparently replace TCP stacks with SMC-R stacks at the process level or net namespace level. The mechanism uses LD_PRELOAD or sysctl net.smc.tcp2smc to transparently replace AF_INET sockets with AF_SMC sockets in applications. This type of replacement enables data transmission over SMC-R stacks and improve network performance based on RDMA without the need to modify applications.
Automatic negotiation and secure fallback
SMC-R provides automatic negotiation capabilities and can dynamically fall back to TCP. To establish SMC-R communications, an SMC-R stack establishes in the kernel a TCP connection to the peer node. During the handshake process, the local node uses specific TCP options to indicate its support for SMC-R and verifies that the peer node also supports SMC-R.
If the negotiation is successful, the SMC-R stacks on the local and peer nodes create new RDMA resources or reuse existing RDMA resources to establish a usable RDMA RC link. Data is transmitted between the nodes through the RDMA link afterward.
If the negotiation fails due to specific reasons, such as because the local or peer node does not have RDMA devices, the SMC-R stacks automatically fall back to TCP stacks. The local and peer nodes use the TCP connection that is established during the negotiation to transmit data.
NoteSMC-R supports fallback to TCP stacks only during connection negotiation. SMC-R does not support fallback to TCP stacks during data transmission.
The following figure shows the data flows for connection negotiation and data transmission.
Shared memory communications based on RDMA
After the negotiation is complete and a connection is established, each SMC-R stack locally allocates the SMC-R socket a ring-shaped send buffer (sndbuf) that is used to cache data to be sent and a ring-shaped remote memory buffer (RMB) that is used to cache data to be received.
When an application on the sending node attempts to send data, the application uses socket interfaces to copy the data to the local sndbuf. Then, the SMC-R stack on the sending node performs RDMA Write operations to write the data to the RMB of the receiving node, and performs RDMA Send operations or RDMA Receive operations to send or receive Connection Data Control (CDC) messages to update and synchronize cursors in ring-shaped buffers.
When the SMC-R stack on the receiving node detects that data is written to the RMB, the SMC-R stack uses different methods, such as epoll, to tell the application on the receiving node to copy the data from the RMB to the user-mode buffer. The data transmission is complete when the data is copied to the user-mode buffer. In SMC-R, RMBs are used as shared memory during data transmission.
The following figure shows the data transmission procedure.
Use scenarios
SMC-R is suitable for the following scenarios:
Latency-sensitive data queries and processing
SMC-R is suitable for scenarios that involve high-performance data queries and data processing and require high network performance, such as Redis, Memcached, and PostgreSQL. SMC-R can replace TCP in applications in a transparent and non-invasive manner and allows applications to gain a 50% increase in queries per second (QPS) without further development or adaptation.
High-throughput data transmission
A large amount of bandwidth and CPU resources are consumed when data is exchanged or transmitted at a large scale within a cluster. RDMA enables SMC-R to deliver the same throughput at a lower CPU load than traditional TCP stacks. This saves computing resources.
During the handshake process of SMC-R, RDMA resources are requested and created. Therefore, SMC-R is not suitable for short-lived connection scenarios in which connections are frequently established and closed.
The number of connections that SMC-R supports for an ECS instance is subject to the following factors:
Available contiguous physical memory of the instance. By default, the sndbuf and the RMB that are used by each SMC-R socket use the contiguous physical memory that is allocated when an SMC-R connection is established. The default size of the sndbuf is the net.smc.wmem value, and the default size of the RMB is the net.smc.rmem value. You can run the following commands to view the default sizes:
sysctl net.smc.wmem # The default size of the sndbuf used by each SMC-R socket. Unit: bytes. sysctl net.smc.rmem # The default size of the RMB used by each SMC-R socket. Unit: bytes.
Elastic RDMA Interface (ERI) eRDMA specifications. The maximum number of RDMA resources that SMC-R creates for a connection, such as Queue Pairs (QPs), Memory Registrations (MRs), Completion Queues (CQs), and Protection Domains (PDs), varies based on the ERI eRDMA specifications of the instance.
If SMC-R cannot obtain the required resources, SMC-R securely falls back to TCP to ensure stable and reliable data transmission.
Instructions
Alibaba Cloud Linux 3 provides optimized SMC-R stacks in kernel. The SMC-R stacks are backed by comprehensive SMC-R monitoring and diagnostic tools. To use SMC-R, perform the following steps:
Create an ECS instance that supports ERI.
SMC-R relies on RDMA. Before you use SMC-R, you must create an ECS instance that supports ERI to obtain cloud-based RDMA capabilities.
ImportantAlibaba Cloud ERI eRDMA devices and SMC do not support IPv6 addresses. If applications use IPv6, SMC falls back to TCP.
Run the following command to load the
smc
andsmc_diag
kernel modules:modprobe smc modprobe smc_diag
You can run the
dmesg
command to display kernel-related messages. If the kernel modules are loaded, the following information is displayed:smc: smc: load SMC module with reserve_mode NET: Registered protocol family 43 smc: netns <netns ID> reserved ports [65500 ~ 65515] for eRDMA OOB smc: adding ib device erdma_0 with port count 1 smc: ib device erdma_0 port 1 has pnetid
NoteFor kernel version 5.10.134-015 and later, 16 socket ports from port 65500 to port 65515 in net namespaces that can access ERIs are used to create out-of-band (OOB) RDMA connections when SMC modules are being loaded, due to the combined use of SMC-R and eRDMA. If the ports cannot be used, the SMC modules can be loaded but ERI eRDMA devices cannot be used. When the SMC modules are unloaded, the used socket ports are freed.
You can run the following command to view the kernel version:
uname -r
Information displayed when ERI eRDMA devices cannot be used because the ports cannot be used when SMC modules are loaded:
smc: smc: load SMC module with reserve_mode NET: Registered protocol family 43 warning: smc: netns <netns ID> reserved ports <The ports that cannot be used.> FAIL for eRDMA OOB
You can run the following command to unload the SMC modules:
rmmod smc_diag rmmod smc
Information displayed when SMC modules are unloaded:
NET: Unregistered protocol family 43 smc: removing ib device erdma_0 smc: netns <netns ID> released ports [65500 ~ 65515] used by eRDMA OOB
Run the following command to install smc-tools:
yum install -y smc-tools
(Optional) Specify the default sndbuf and RMB sizes.
Each SMC-R stack locally allocates the SMC-R socket a ring-shaped sndbuf that is used to cache data to be sent and a ring-shaped RMB that is used to cache data to be received. For more information, see the Architecture section of this topic. Compared with send buffers and receive buffers in TCP, sndbufs and RMBs in SMC-R range from 16 KB to 512 KB in size.
To maximize network acceleration based on SMC-R, you can use the following methods to change the default sndbuf and RMB sizes of SMC-R sockets for throughput-intensive network models.
Alibaba Cloud Linux 3 provides the sysctl net.smc.wmem and sysctl net.smc.rmem commands to configure the default sndbuf and RMB sizes for subsequent SMC-R sockets in the current net namespace. The default sndbuf and RMB sizes of existing SMC-R sockets are not affected.
sysctl net.smc.wmem=<sndbuf size, in bytes> sysctl net.smc.rmem=<RMB size, in bytes>
The initial
sysctl net.smc.wmem
andsysctl net.smc.rmem
values are 256 KB.The application can also configure the SO_SNDBUF and SO_RCVBUF options by using the setsockopt() call to change the sizes of the sndbuf and RMB that are used by the SMC-R socket.
Run TCP socket applications over SMC stacks.
Alibaba Cloud Linux 3 allows SMC to transparently replace TCP at the net namespace level or process level.
Net namespace-level transparent replacement
Alibaba Cloud Linux 3 provides the net namespace-level transparent replacement feature that allows you to run the
sysctl net.smc.tcp2smc
command to replace TCP sockets with SMC sockets in a net namespace. The TCP sockets must meet the following conditions:The family value is AF_INET.
The type value is SOCK_STREAM.
The protocol value is IPPROTO_IP or IPPROTO_TCP.
The following figure shows the replacement procedure.
To configure transparent replacement for a net namespace, perform the following steps:
Run the following command to enable transparent replacement for a net namespace:
sysctl net.smc.tcp2smc=1
By default,
sysctl net.smc.tcp2smc
is set to 0, which indicates that transparent replacement is disabled.Run the following command to run TCP socket applications in the net namespace:
./foo
The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.
Run the following command to disable transparent replacement for the net namespace:
sysctl net.smc.tcp2smc=0
Process-level transparent replacement
Alibaba Cloud Linux 3 also provides the process-level transparent replacement feature to replace TCP with SMC-R for an application. This feature requires the SMC-R monitoring and diagnostic toolkit smc-tools. For information about how to install smc-tools, see 3. Install smc-tools.
The following figure shows the replacement procedure.
When you execute the
smc_run
script from smc-tools to run applications, thesmc_run
script uses theLD_PRELOAD
environment variable to set libsmc-preload.so in smc-tools as the dynamic library that must be loaded first.libsmc-preload.so replaces the TCP sockets in an application and in the child processes of the application with SMC sockets. The TCP sockets must meet the following conditions:
The family value is AF_INET.
The type value is SOCK_STREAM.
The protocol value is IPPROTO_IP or IPPROTO_TCP.
Run the following command to run foo over SMC-R stacks:
smc_run ./foo
The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.
Track and diagnose SMC-R connections and RDMA resources.
You can use smc-tools to track and diagnose multiple aspects of SMC-R. smc-tools consists of the following tools:
smcr: shows statistics about SMC-R resources.
smcss: shows information about SMC-R sockets.
Use smcr.
smcr is used to show information about the RDMA devices and links that are used by SMC-R.
Sample smcr commands:
Run the following command to view the man page:
man smcr
Run the following command to view the available SMC devices:
smcr device
Sample command output:
Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID eth0 erdma_0 1 ACTIVE 0x107f No 0
Run the following command to view information about the RDMA links used by SMC-R:
smcr l
Sample command output:
LG-ID LG-Role LG-Type VLAN #Conns PNET-ID 00000100 CLNT SINGLE 0 32 00000200 CLNT SINGLE 0 32 00000300 CLNT SINGLE 0 32 00000400 CLNT SINGLE 0 32 00000500 CLNT SINGLE 0 32 00000600 CLNT SINGLE 0 32 00000700 CLNT SINGLE 0 8
The preceding sample command output indicates that seven RDMA links are established in the SMC-R stack on the client. Thirty-two connections are established over the first six links, and eight connections are established over the last link.
Run the following command to view relevant statistics, including connection statistics, fallback statistics, statistics about sent data, statistics about received data, and memory usage statistics:
smcr -dd stats
Sample command output:
SMC-R Connections Summary Total connections handled 509 SMC connections 509 (client 0, server 509) v1 509 v2 0 Handshake errors 0 (client 0, server 0) Avg requests per SMC conn 1603405.0 TCP fallback 0 (client 0, server 0) RX Stats Data transmitted (Bytes) 17954924988 (17.95G) Total requests 408066678 Buffer full 0 (0.00%) Buffer downgrades 0 Buffer reuses 308 8KB 16KB 32KB 64KB 128KB 256KB 512KB >512KB Bufs 0 0 0 0 0 509 0 0 Reqs 408.1M 0 0 0 0 0 0 0 TX Stats Data transmitted (Bytes) 70595498981 (70.60G) Total requests 408066477 Buffer full 0 (0.00%) Buffer full (remote) 0 (0.00%) Buffer too small 0 (0.00%) Buffer too small (remote) 0 (0.00%) Buffer downgrades 0 Buffer reuses 308 8KB 16KB 32KB 64KB 128KB 256KB 512KB >512KB Bufs 0 0 0 0 509 0 0 0 Reqs 408.1M 0 0 0 0 0 0 0 Extras Special socket calls 508 cork 0 nodelay 508 sendpage 0 splice 0 urgent data 0
Use smcss.
smcss is used to show information about SMC-R sockets.
Sample smcss commands:
Run the following command to view the man page:
man smcss
Run the following command to view the details of all SMC-R sockets:
smcss -R
Sample command output:
State UID Inode Local Address Peer Address Intf Mode Role IB-device Port Linkid GID Peer-GID ACTIVE 00000 1894397 172.16.14.xxx:45346 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1898550 172.16.14.xxx:45354 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1894399 172.16.14.xxx:45362 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1898552 172.16.14.xxx:45378 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1898554 172.16.14.xxx:45392 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1895027 172.16.14.xxx:45400 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1897069 172.16.14.xxx:45412 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx ACTIVE 00000 1895852 172.16.14.xxx:45426 172.16.14.xxx:80 0000 SMCR CLNT erdma_0 01 01 0000:0000:0000:0000:0000:ffff:ac10:xxxx 0000:0000:0000:0000:0000:ffff:ac10:xxxx
Run the following command to view the details of all SMC sockets, including the SMC sockets that securely fall back to TCP sockets due to negotiation failures:
smcss -a
Sample command output:
State UID Inode Local Address Peer Address Intf Mode ACTIVE 00000 1903782 172.16.14.xxx:42232 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1898075 172.16.14.xxx:42236 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1900819 172.16.14.xxx:42242 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1900821 172.16.14.xxx:42244 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1898077 172.16.14.xxx:42260 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1902717 172.16.14.xxx:42270 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1893237 172.16.14.xxx:42276 172.16.14.xxx:80 0000 TCP 0x03010000 ACTIVE 00000 1902719 172.16.14.xxx:42292 172.16.14.xxx:80 0000 TCP 0x03010000
For each connection that falls back to TCP,
TCP
and a cause code are displayed in theMode
column. For example, in the preceding command output, the cause code0x03010000
is displayed in the Mode column. For information about cause codes and solutions for SMC-to-TCP fallbacks, see SMC falls back to TCP and RDMA cannot be used to accelerate communications.
References
For information about how to resolve SMC issues, such as communication failures and unusable ports, see SMC issues.