Alibaba Cloud Linux 3 is tuned to support Shared Memory Communications over RDMA (SMC-R). SMC-R is based on Alibaba Cloud eRDMA and can transparently replace TCP in applications without loss of functionality. SMC-R enables direct, high-speed, low-latency, and memory-to-memory communications and provides higher performance than TCP in various scenarios such as in-memory databases, remote procedure calls (RPCs), and transmission of large files. This topic describes SMC-R and how to work with it.

Background information

SMC-R is based on Remote Direct Memory Access (RDMA). Before you can understand SMC-R, you must understand what RDMA and Alibaba Cloud eRDMA are.

RDMA is a remote direct memory access technology that bypasses kernel intervention. It is widely used in data- and compute-intensive scenarios and is suitable for a wide array of fields such as high-performance computing, machine learning, data centers, and mass storage. RDMA underlies the core services of Alibaba Cloud, such as Enhanced SSD (ESSD) and PolarDB, and has been tried and tested in critical business scenarios such as Double 11.

RDMA has the zero-copy and stack offload capabilities. It offloads protocol stacks to RDMA Network Interface Controllers (RNICs) and performs direct memory access without involving the kernel. By bypassing the operating system (OS) kernel stack, RDMA reduces CPU processing costs and delivers lower latency and higher throughput than traditional TCP networks. The following figure shows the differences between TCP/IP and RDMA stacks.TCP/IP vs RDMA

Previously, RDMA was used only with NICs and switches in some data centers and was complex to deploy. Alibaba Cloud eRDMA is a service that brings RDMA to the cloud for easy use, and allows Elastic Compute Service (ECS) users to use RDMA for data transmission without concerns about complex physical network environment configurations such as NICs and switches.

However, due to the great differences between the InfiniBand (IB) verbs interfaces used by RDMA and common POSIX socket interfaces, conventional applications must be significantly transformed before they can work with RDMA. Meanwhile, technical expertise is required to use IB verbs interfaces in an efficient manner.

To make full use of eRMDA and deliver higher network performance, Alibaba Cloud Linux 3 provides optimized SMC-R and supports eRDMA. SMC-R provides a standard socket interface over RDMA to applications. SMC-R uses RDMA in an efficient manner and remains compatible with standard TCP applications to allow more applications to benefit from RDMA without modifications.

Architecture

SMC-R architecture:
  • Protocol hierarchy and transparent replacement
    SMC-R is a reliable streaming transmission protocol that is fully compatible with sockets. SMC-R functions between the socket layer and the IB verbs layer in kernel space. It supports common socket interfaces and uses the IB verbs kernel-mode interface to help the RDMA driver transmit data. Alibaba Cloud Linux 3 provides a tool for the replacement of the protocol family at the socket layer to allow SMC-R to transparently replace TCP at the net namespace or application level. When you use Alibaba Cloud Linux 3, you can transition from TCP to SMC-R and achieve higher network performance based on RDMA without the need to modify network applications. The following figure shows the architecture of SMC-R.Transparent replacement
  • Automatic negotiation and secure fallback
    SMC-R has the automatic negotiation capability and can dynamically fall back to TCP. To establish an RDMA connection, SMC-R first establishes in the kernel a TCP connection to the remote node. During the handshake, the local node uses specific TCP options to indicate its support for SMC-R and verifies that the remote node also supports SMC-R.
    • If both the local and remote nodes are confirmed to support SMC-R during the negotiation, SMC-R applies for required RDMA resources for user-mode network applications. The RDMA resources include the queue pair (QP) and completion queue (CQ) required to build an asynchronous RDMA communication model. At the same time, SMC-R creates a send buffer and a receive buffer and registers the receive buffer as the remote memory buffer (RMB) to which the remote node has direct access. Then, SMC-R initializes the RDMA connection. SMC-R encapsulates the access key (RKey) and start address of the RMB into a RToken and notifies the remote node of the RToken.
    • If the local or remote node is found not to support SMC-R during the negotiation, the fallback-to-TCP mechanism is triggered and the local and remote nodes use the established TCP connection to transmit data and ensure network stability and reliability.
      Note Note that SMC-R can fall back to a TCP stack only during connection negotiation. SMC-R cannot fall back to a TCP stack during data transmission.
    The following figure shows the network flows for negotiation and data transmission.Fallback-to-TCP mechanism architecture
  • Ring memory and data receiving and sending

    SMC-R relies on efficient RDMA networks and a ring-shaped shared memory architecture for high-performance data transmission. After network applications transmit data to the kernel, SMC-R performs RDMA Write operations to write the data to the ring RMB of the remote node and performs RDMA Send and Receive operations to send and receive Connection Data Control (CDC) messages for updating and synchronizing the cursors in RMBs.

    For the RMB on one side, the reading peer updates its consumer cursor to indicate the address of the next byte of data to be consumed. To prevent data loss, the writing peer does not write data to the RMB beyond the consumer cursor. Similarly, the writing peer updates its producer cursor to indicate the address of the next byte of data to be written. To ensure data correctness, the reading peer does not read data from the RMB beyond the producer cursor. Cursors are updated and synchronized between peers to manage and track writes and reads to ensure the security and reliability of data transmission.

    The following figure shows the data transmission procedure.Data sending and receiving

Benefits

SMC-R has the following benefits:
  • High performance
    RDMA offloads data-plane features to RNICs and bypasses the kernel to directly access remote ring receive buffers. This enables SMC-R to have lower latency, higher throughput, and smaller CPU loads than traditional TCP stacks in specific scenarios.
    • SMC-R protocol stacks are more lightweight than TCP stacks.
    • SMC-R uses RDMA for communication to lower latency and CPU loads and improve throughput.
    • SMC-R has direct access to efficient, reliable remote ring buffers.
  • Transparent replacement
    SMC-R is compatible with POSIX socket interfaces and can transparently replace TCP in new connection stacks at the net namespace or application level by using sysctl and user-mode tools at no additional costs for manually modifying or further developing applications.
    • SMC-R exploits RDMA Reliable Connection (RC) transports at the underlying layer and is compatible with socket interfaces to provide reliable streaming transmission in place of TCP.
    • SMC-R has the automatic negotiation and secure fallback-to-TCP mechanisms.
    • SMC-R can transparently replace TCP at the net namespace or application level without loss of functionality.
    • SMC-R is compatible with eRDMA Internet Wide-area RDMA Protocol (iWARP) and RDMA over Converged Ethernet (RoCE) networks at the underlying layer.

Use scenarios

SMC-R is applicable to the following scenarios:
  • Latency-sensitive data queries and processing

    SMC-R is applicable to scenarios that involve high-performance data queries and processing and require high network performance, such as Redis, Memcached, and PostgreSQL. SMC-R allows applications to use it in place of TCP in a transparent and non-invasive manner and gain a 50% increase in QPS without the need for further development or adaptation.

  • High-throughput data transmission

    A large amount of bandwidth and CPU resources tend to be consumed when data is exchanged or transmitted at scale within a cluster. The efficient communication model used by Shared Memory Communications (SMC) enables SMC-R to deliver the same throughput at a lower CPU load than traditional TCP stacks. This way, computing resources are saved.

Instructions

Alibaba Cloud Linux 3 provides a wide array of monitoring and maintenance tools for you to monitor the status of SMC-R and diagnose its issues. You can perform the following procedure to use SMC-R.

  1. Load the SMC-R modules.
    By default, SMC-R is compiled into kernel modules: smc and smc_diag. You can manually load these modules in the system.
    1. Run the following command to load the smc kernel module:
      modprobe smc
    2. Run the following command to load the smc_diag kernel module:
      modprobe smc_diag
  2. Enable transparent replacement.
    Alibaba Cloud Linux 3 supports net namespace-level and application-level transparent replacement to allow SMC-R to transparently replace TCP for net namespaces or applications.
    • Net namespace-level transparent replacement
      Alibaba Cloud Linux 3 provides the net namespace-level transparent replacement feature to replace new TCP sockets with SMC-R sockets within a net namespace. The following figure shows the replacement procedure.Logic of net namespace-level transparent replacementYou can perform the following operations to configure transparent replacement for a net namespace:
      1. Run the following command to enable transparent replacement for a net namespace:
        sysctl net.smc.tcp2smc=1
        By default, sysctl net.smc.tcp2smc is set to 0, which indicates that transparent replacement is disabled. When sysctl net.smc.tcp2smc is set to 1, the protocol family of new sockets established for applications changes from PF_INET/PF_INET6 to AF_SMC. This way, TCP sockets transition to SMC-R sockets.
      2. Run applications.

        If the operations in Step i are also performed at the peer, the local and remote nodes use SMC-R for data transmission. If the operations are not performed at the peer, the local and remote nodes fall back to TCP for data transmission. For more information about the negotiation process, see the Automatic negotiation and secure fallback section in this topic.

      3. Run the following command to disable transparent replacement for the net namespace:
        sysctl net.smc.tcp2smc=0
    • Application-level transparent replacement
      Alibaba Cloud Linux 3 also provides the application-level transparent replacement feature to replace TCP with SMC-R for an application. This feature requires the SMC-R monitoring and maintenance toolkit smc-tools.
      Note You can run the yum install smc-tools -y command to install the smc-tools toolkit. For more information about the smc-tools toolkit, see Step 3.
      When you execute the smc_run script from smc-tools to run applications, the smc_run script uses the LD_PRELOAD environment variable to define libsmc-preload.so in smc-tools as the dynamic library to be loaded first. libsmc-preload.so attempts to replace the new TCP sockets established for applications with SMC-R sockets. smc_run command description:
      Usage: smc_run [ OPTIONS ] COMMAND
      
      Run COMMAND using SMC for TCP sockets
      For example, to use SMC-R to run the testApp application in the current directory, run the following command:
      smc_run ./testApp
      Similar to net namespace-level transparent replacement, application-level transparent replacement requires the local and remote nodes to transparently replace TCP with SMC-R by using smc_run before an SMC-R connection can be established for RDMA communication.
  3. Use the SMC-R monitoring and maintenance tools to monitor SMC-R.
    You can use smc-tools to track and diagnose SMC-R from multiple aspects. smc-tools includes the following tools:
    • smcr: shows information about SMC-R, such as information about linkgroups and devices.
    • smcss: shows information about active SMC-R sockets.
    1. Run the following command to install smc-tools:
      yum install smc-tools -y
    2. Use smcr.
      smcr is used to show information about SMC-R, such as information about linkgroups and devices.
      Command description:
      Usage: smcr  [ OPTIONS ] OBJECT {COMMAND | help}
      OBJECT : { linkgroup | device }
              linkgroup
                  Linkgroup(s) or link(s) used by SMC-R.
              device
                  One or more SMC-R devices.
      OPTIONS : {-v[ersion] | -d[etails] | -dd[etails]}
              -v, -version
                  Print the version of the smcr utility and exit.
              -d, -details
                  Print detailed information.
              -dd, -ddetails
                  Print more detailed information.
      Usage example:
      • You can run the following command to view SMC-R device information:
        smcr device
        Example command output:
        Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
        eth0            erdma_01    1    ACTIVE  0x7ffd          No       0
      • You can run the following command to view SMC-R linkgroup information:
        smcr linkgroup
        Example command output:
        LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
        00000100 CLNT     SINGLE      0       1
    3. Use smcss.
      smcss shows information about SMC-R sockets.
      Command description:
      Usage: smcss [ OPTIONS ]
      OPTIONS :
             (none)
                    displays a list of connecting, closing, or connected SMC sockets with basic information.
             -a, --all
                    displays all types of SMC sockets: listening, opening, closing, and connected.
             -l, --listening
                    shows listening sockets only. These are omitted by default.
             -d, --debug
                    displays additional debug information, such as shutdown state.
             -D, --smcd
                    displays additional SMC-D specific information. Shows SMC-D sockets only.
             -h, --help
                    displays usage information.
             -R, --smcr
                    displays additional SMC-R specific information. Shows SMC-R sockets only.
             -v, --version
                    displays program version.
             -W, --wide
                    do not truncate IP addresses.
      Usage example:
      You can run the following command to view detailed information about all SMC-R sockets.
      smcss -a -R -d
      Example command output:
      State          UID   Inode   Local Address           Peer Address            Intf Mode Shutd Token    Sndbuf   Rcvbuf   Peerbuf  rxprod-Cursor rxcons-Cursor rxFlags txprod-Cursor txcons-Cursor txFlags txprep-Cursor txsent-Cursor txfin-Cursor  Role IB-device       Port Linkid GID                                      Peer-GID
      ACTIVE         00000 1105985880 192.168.XX.XX:49080    192.168.XX.XX:10003    0000 SMCR  <->  00000001 00020000 00040000 00040000 0001:00026256 0001:00026256 00:00   0001:00026264 0001:00026256 00:00   0003:00006264 0003:00006264 0003:00006264 CLNT erdma_012211    01   01     0016:3e01:2211:0000:0000:0000:0000:0000  0016:3e01:43b4:0000:0000:0000:0000:0000