All Products
Search
Document Center

Alibaba Cloud Linux:Use SMC

Last Updated:Mar 21, 2024

Alibaba Cloud Linux 3 provides Shared Memory Communication (SMC), a high-performance network protocol that functions in kernel space. SMC utilizes shared memory technology and works with socket interfaces to establish network communications. SMC is classified into the following types based on shared memory technology: Shared Memory Communications - Direct Memory Access (SMC-D) and Shared Memory Communications over Remote Direct Memory Access (SMC-R). SMC-D uses internal shared memory (ISM) technology, and SMC-R uses remote direct memory access (RDMA) technology. This topic describes SMC-R and how to use it.

Background information

IBM open-sourced SMC-R for Linux 4.11 in 2017 and has been maintaining it until now. For more information about SMC-R, see RFC 7609. Alibaba Cloud Linux 3 leverages Alibaba Cloud Elastic RDMA (eRDMA) to unlock the use of SMC-R in the cloud. SMC-R can transparently replace TCP in applications without loss of functionality and deliver high-performance, hardware-software co-designed networks that are accessible to all users.

The shared memory-based data exchange model of SMC-R relies on the atomic memory operations provided by RDMA. RDMA implements protocol stacks in RDMA network interface cards (RNICs) to allow network nodes to bypass kernel and directly access remote memory. Compared with traditional TCP networks, RDMA networks reduce memory-to-memory copies and consume less CPU resources in data transfers to provide low-latency, high-throughput communications. The following figure shows the differences between TCP/IP stacks and RDMA stacks.

image

RDMA is widely used in data-intensive and compute-intensive scenarios and is suitable for multiple fields, such as high-performance computing, machine learning, data centers, and massive storage.

In the past, RDMA was used only in some data centers together with network cards and switches and was complex to deploy. Alibaba Cloud eRDMA brings RDMA to the cloud. This allows all Elastic Compute Service (ECS) users to use RDMA to transmit data without the need to make complex configurations for the underlying physical network environment, such as network cards and switches.

RDMA relies on InfiniBand (IB) verbs interfaces, which are significantly different from traditional POSIX socket interfaces. Existing socket applications must be greatly transformed before they can be migrated to RDMA networks. A high level of technical expertise is required to apply RDMA.

To utilize eRDMA and deliver higher network performance, Alibaba Cloud Linux 3 provides optimized SMC-R. Optimized SMC-R utilizes RDMA in an efficient manner and is compatible with standard TCP applications. This helps improve the performance of more applications without modifications.

Benefits

SMC-R has the following benefits:

  • High performance

    RDMA offloads protocol stacks from kernel to network cards. This equips SMC-R with lower network latency, higher throughput, and lower CPU utilization over traditional TCP stacks in specific scenarios.

    • Hardware offloading

    • Reliable and efficient direct access to remote memory

  • Transparent replacement

    SMC-R is compatible with POSIX socket interfaces and provides transparent replacement features to allow socket applications to switch from TCP stacks to SMC-R stacks without modifications or further development.

    • SMC-R can call socket interfaces to provide shared memory communications.

    • SMC-R enables multi-level transparent replacement of protocol stacks without functionality loss.

    • SMC-R provides the automatic negotiation and secure fallback mechanisms.

Architecture

SMC-R architecture:

  • Protocol hierarchy and transparent replacement

    SMC-R functions in kernel space and supports the network behaviors that user-mode programs describe by using socket interfaces. SMC-R also performs RDMA transmission by using IB verbs interfaces. SMC-R stacks are responsible for using, managing, and maintaining RDMA resources. Applications are not affected by the RDMA entities in kernel. The following figure shows the architecture of SMC-R.

    image

    Alibaba Cloud Linux 3 provides a mechanism that can be used to transparently replace TCP stacks with SMC-R stacks at the process level or net namespace level. The mechanism uses LD_PRELOAD or sysctl net.smc.tcp2smc to transparently replace AF_INET sockets with AF_SMC sockets in applications. This type of replacement enables data transmission over SMC-R stacks and improve network performance based on RDMA without the need to modify applications.

  • Automatic negotiation and secure fallback

    SMC-R provides automatic negotiation capabilities and can dynamically fall back to TCP. To establish SMC-R communications, an SMC-R stack establishes in the kernel a TCP connection to the peer node. During the handshake process, the local node uses specific TCP options to indicate its support for SMC-R and verifies that the peer node also supports SMC-R.

    • If the negotiation is successful, the SMC-R stacks on the local and peer nodes create new RDMA resources or reuse existing RDMA resources to establish a usable RDMA RC link. Data is transmitted between the nodes through the RDMA link afterward.

    • If the negotiation fails due to specific reasons, such as because the local or peer node does not have RDMA devices, the SMC-R stacks automatically fall back to TCP stacks. The local and peer nodes use the TCP connection that is established during the negotiation to transmit data.

      Note

      SMC-R supports fallback to TCP stacks only during connection negotiation. SMC-R does not support fallback to TCP stacks during data transmission.

    The following figure shows the data flows for connection negotiation and data transmission.

    image
  • Shared memory communications based on RDMA

    After the negotiation is complete and a connection is established, each SMC-R stack locally allocates the SMC-R socket a ring-shaped send buffer (sndbuf) that is used to cache data to be sent and a ring-shaped remote memory buffer (RMB) that is used to cache data to be received.

    • When an application on the sending node attempts to send data, the application uses socket interfaces to copy the data to the local sndbuf. Then, the SMC-R stack on the sending node performs RDMA Write operations to write the data to the RMB of the receiving node, and performs RDMA Send operations or RDMA Receive operations to send or receive Connection Data Control (CDC) messages to update and synchronize cursors in ring-shaped buffers.

    • When the SMC-R stack on the receiving node detects that data is written to the RMB, the SMC-R stack uses different methods, such as epoll, to tell the application on the receiving node to copy the data from the RMB to the user-mode buffer. The data transmission is complete when the data is copied to the user-mode buffer. In SMC-R, RMBs are used as shared memory during data transmission.

    The following figure shows the data transmission procedure.

    image

Use scenarios

SMC-R is suitable for the following scenarios:

  • Latency-sensitive data queries and processing

    SMC-R is suitable for scenarios that involve high-performance data queries and data processing and require high network performance, such as Redis, Memcached, and PostgreSQL. SMC-R can replace TCP in applications in a transparent and non-invasive manner and allows applications to gain a 50% increase in queries per second (QPS) without further development or adaptation.

  • High-throughput data transmission

    A large amount of bandwidth and CPU resources are consumed when data is exchanged or transmitted at a large scale within a cluster. RDMA enables SMC-R to deliver the same throughput at a lower CPU load than traditional TCP stacks. This saves computing resources.

Note
  • During the handshake process of SMC-R, RDMA resources are requested and created. Therefore, SMC-R is not suitable for short-lived connection scenarios in which connections are frequently established and closed.

  • The number of connections that SMC-R supports for an ECS instance is subject to the following factors:

    • Available contiguous physical memory of the instance. By default, the sndbuf and the RMB that are used by each SMC-R socket use the contiguous physical memory that is allocated when an SMC-R connection is established. The default size of the sndbuf is the net.smc.wmem value, and the default size of the RMB is the net.smc.rmem value. You can run the following commands to view the default sizes:

      sysctl net.smc.wmem # The default size of the sndbuf used by each SMC-R socket. Unit: bytes.
      sysctl net.smc.rmem # The default size of the RMB used by each SMC-R socket. Unit: bytes.
    • Elastic RDMA Interface (ERI) eRDMA specifications. The maximum number of RDMA resources that SMC-R creates for a connection, such as Queue Pairs (QPs), Memory Registrations (MRs), Completion Queues (CQs), and Protection Domains (PDs), varies based on the ERI eRDMA specifications of the instance.

    If SMC-R cannot obtain the required resources, SMC-R securely falls back to TCP to ensure stable and reliable data transmission.

Instructions

Alibaba Cloud Linux 3 provides optimized SMC-R stacks in kernel. The SMC-R stacks are backed by comprehensive SMC-R monitoring and diagnostic tools. To use SMC-R, perform the following steps:

  1. Create an ECS instance that supports ERI.

    SMC-R relies on RDMA. Before you use SMC-R, you must create an ECS instance that supports ERI to obtain cloud-based RDMA capabilities.

    Important

    Alibaba Cloud ERI eRDMA devices and SMC do not support IPv6 addresses. If applications use IPv6, SMC falls back to TCP.

  2. Run the following command to load the smc and smc_diag kernel modules:

    modprobe smc
    modprobe smc_diag

    You can run the dmesg command to display kernel-related messages. If the kernel modules are loaded, the following information is displayed:

    smc: smc: load SMC module with reserve_mode
    NET: Registered protocol family 43
    smc: netns <netns ID> reserved ports [65500 ~ 65515] for eRDMA OOB
    smc: adding ib device erdma_0 with port count 1
    smc:    ib device erdma_0 port 1 has pnetid
    Note

    For kernel version 5.10.134-015 and later, 16 socket ports from port 65500 to port 65515 in net namespaces that can access ERIs are used to create out-of-band (OOB) RDMA connections when SMC modules are being loaded, due to the combined use of SMC-R and eRDMA. If the ports cannot be used, the SMC modules can be loaded but ERI eRDMA devices cannot be used. When the SMC modules are unloaded, the used socket ports are freed.

    • You can run the following command to view the kernel version:

      uname -r
    • Information displayed when ERI eRDMA devices cannot be used because the ports cannot be used when SMC modules are loaded:

      smc: smc: load SMC module with reserve_mode
      NET: Registered protocol family 43
      warning: smc: netns <netns ID> reserved ports <The ports that cannot be used.> FAIL for eRDMA OOB 
    • You can run the following command to unload the SMC modules:

      rmmod smc_diag
      rmmod smc
    • Information displayed when SMC modules are unloaded:

      NET: Unregistered protocol family 43
      smc: removing ib device erdma_0
      smc: netns <netns ID> released ports [65500 ~ 65515] used by eRDMA OOB
  3. Run the following command to install smc-tools:

    yum install -y smc-tools
  4. (Optional) Specify the default sndbuf and RMB sizes.

    Each SMC-R stack locally allocates the SMC-R socket a ring-shaped sndbuf that is used to cache data to be sent and a ring-shaped RMB that is used to cache data to be received. For more information, see the Architecture section of this topic. Compared with send buffers and receive buffers in TCP, sndbufs and RMBs in SMC-R range from 16 KB to 512 KB in size.

    To maximize network acceleration based on SMC-R, you can use the following methods to change the default sndbuf and RMB sizes of SMC-R sockets for throughput-intensive network models.

    Alibaba Cloud Linux 3 provides the sysctl net.smc.wmem and sysctl net.smc.rmem commands to configure the default sndbuf and RMB sizes for subsequent SMC-R sockets in the current net namespace. The default sndbuf and RMB sizes of existing SMC-R sockets are not affected.

    sysctl net.smc.wmem=<sndbuf size, in bytes>
    sysctl net.smc.rmem=<RMB size, in bytes>

    The initial sysctl net.smc.wmem and sysctl net.smc.rmem values are 256 KB.

    The application can also configure the SO_SNDBUF and SO_RCVBUF options by using the setsockopt() call to change the sizes of the sndbuf and RMB that are used by the SMC-R socket.

  5. Run TCP socket applications over SMC stacks.

    Alibaba Cloud Linux 3 allows SMC to transparently replace TCP at the net namespace level or process level.

    • Net namespace-level transparent replacement

      Alibaba Cloud Linux 3 provides the net namespace-level transparent replacement feature that allows you to run the sysctl net.smc.tcp2smc command to replace TCP sockets with SMC sockets in a net namespace. The TCP sockets must meet the following conditions:

      • The family value is AF_INET.

      • The type value is SOCK_STREAM.

      • The protocol value is IPPROTO_IP or IPPROTO_TCP.

      The following figure shows the replacement procedure.

      image

      To configure transparent replacement for a net namespace, perform the following steps:

      1. Run the following command to enable transparent replacement for a net namespace:

        sysctl net.smc.tcp2smc=1

        By default, sysctl net.smc.tcp2smc is set to 0, which indicates that transparent replacement is disabled.

      2. Run the following command to run TCP socket applications in the net namespace:

        ./foo

        The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.

      3. Run the following command to disable transparent replacement for the net namespace:

        sysctl net.smc.tcp2smc=0
    • Process-level transparent replacement

      Alibaba Cloud Linux 3 also provides the process-level transparent replacement feature to replace TCP with SMC-R for an application. This feature requires the SMC-R monitoring and diagnostic toolkit smc-tools. For information about how to install smc-tools, see 3. Install smc-tools.

      The following figure shows the replacement procedure.

      image

      When you execute the smc_run script from smc-tools to run applications, the smc_run script uses the LD_PRELOAD environment variable to set libsmc-preload.so in smc-tools as the dynamic library that must be loaded first.

      libsmc-preload.so replaces the TCP sockets in an application and in the child processes of the application with SMC sockets. The TCP sockets must meet the following conditions:

      • The family value is AF_INET.

      • The type value is SOCK_STREAM.

      • The protocol value is IPPROTO_IP or IPPROTO_TCP.

    Run the following command to run foo over SMC-R stacks:

    smc_run ./foo

    The TCP sockets created by the foo application are transparently replaced by SMC sockets. The network behaviors of applications are handled by SMC-R stacks. If the local and peer nodes support SMC-R and the negotiation is successful, the nodes transmit data to each other based on RDMA. If the local or peer node does not support SMC-R or the negotiation fails, the nodes securely fall back to TCP for data transmission. For more information, see the Architecture section of this topic.

  6. Track and diagnose SMC-R connections and RDMA resources.

    You can use smc-tools to track and diagnose multiple aspects of SMC-R. smc-tools consists of the following tools:

    • smcr: shows statistics about SMC-R resources.

    • smcss: shows information about SMC-R sockets.

    1. Use smcr.

      smcr is used to show information about the RDMA devices and links that are used by SMC-R.

      Sample smcr commands:

      • Run the following command to view the man page:

        man smcr
      • Run the following command to view the available SMC devices:

        smcr device

        Sample command output:

        Net-Dev         IB-Dev   IB-P  IB-State  Type          Crit  #Links  PNET-ID
        eth0            erdma_0     1    ACTIVE  0x107f          No       0
      • Run the following command to view information about the RDMA links used by SMC-R:

        smcr l

        Sample command output:

        LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
        00000100 CLNT     SINGLE      0      32
        00000200 CLNT     SINGLE      0      32
        00000300 CLNT     SINGLE      0      32
        00000400 CLNT     SINGLE      0      32
        00000500 CLNT     SINGLE      0      32
        00000600 CLNT     SINGLE      0      32
        00000700 CLNT     SINGLE      0       8

        The preceding sample command output indicates that seven RDMA links are established in the SMC-R stack on the client. Thirty-two connections are established over the first six links, and eight connections are established over the last link.

      • Run the following command to view relevant statistics, including connection statistics, fallback statistics, statistics about sent data, statistics about received data, and memory usage statistics:

        smcr -dd stats

        Sample command output:

        SMC-R Connections Summary
          Total connections handled           509
          SMC connections                     509 (client 0, server 509)
            v1                                509
            v2                                  0
          Handshake errors                      0 (client 0, server 0)
          Avg requests per SMC conn       1603405.0
          TCP fallback                          0 (client 0, server 0)
        
        RX Stats
          Data transmitted (Bytes)    17954924988 (17.95G)
          Total requests                408066678
          Buffer full                           0 (0.00%)
          Buffer downgrades                     0
          Buffer reuses                       308
                    8KB    16KB    32KB    64KB   128KB   256KB   512KB  >512KB
          Bufs        0       0       0       0       0     509       0       0
          Reqs   408.1M       0       0       0       0       0       0       0
        TX Stats
          Data transmitted (Bytes)    70595498981 (70.60G)
          Total requests                408066477
          Buffer full                           0 (0.00%)
          Buffer full (remote)                  0 (0.00%)
          Buffer too small                      0 (0.00%)
          Buffer too small (remote)             0 (0.00%)
          Buffer downgrades                     0
          Buffer reuses                       308
                    8KB    16KB    32KB    64KB   128KB   256KB   512KB  >512KB
          Bufs        0       0       0       0     509       0       0       0
          Reqs   408.1M       0       0       0       0       0       0       0
        
        Extras
          Special socket calls                508
            cork                                0
            nodelay                           508
            sendpage                            0
            splice                              0
            urgent data                         0
    2. Use smcss.

      smcss is used to show information about SMC-R sockets.

      Sample smcss commands:

      • Run the following command to view the man page:

        man smcss
      • Run the following command to view the details of all SMC-R sockets:

        smcss -R

        Sample command output:

        State          UID   Inode   Local Address           Peer Address            Intf Mode Role IB-device       Port Linkid GID                                      Peer-GID
        ACTIVE         00000 1894397 172.16.14.xxx:45346     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1898550 172.16.14.xxx:45354     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1894399 172.16.14.xxx:45362     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1898552 172.16.14.xxx:45378     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1898554 172.16.14.xxx:45392     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1895027 172.16.14.xxx:45400     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1897069 172.16.14.xxx:45412     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
        ACTIVE         00000 1895852 172.16.14.xxx:45426     172.16.14.xxx:80        0000 SMCR CLNT erdma_0         01   01     0000:0000:0000:0000:0000:ffff:ac10:xxxx  0000:0000:0000:0000:0000:ffff:ac10:xxxx
      • Run the following command to view the details of all SMC sockets, including the SMC sockets that securely fall back to TCP sockets due to negotiation failures:

        smcss -a

        Sample command output:

        State          UID   Inode   Local Address           Peer Address            Intf Mode
        ACTIVE         00000 1903782 172.16.14.xxx:42232     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1898075 172.16.14.xxx:42236     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1900819 172.16.14.xxx:42242     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1900821 172.16.14.xxx:42244     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1898077 172.16.14.xxx:42260     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1902717 172.16.14.xxx:42270     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1893237 172.16.14.xxx:42276     172.16.14.xxx:80        0000 TCP 0x03010000
        ACTIVE         00000 1902719 172.16.14.xxx:42292     172.16.14.xxx:80        0000 TCP 0x03010000

        For each connection that falls back to TCP, TCP and a cause code are displayed in the Mode column. For example, in the preceding command output, the cause code 0x03010000 is displayed in the Mode column. For information about cause codes and solutions for SMC-to-TCP fallbacks, see SMC falls back to TCP and RDMA cannot be used to accelerate communications.

References

For information about how to resolve SMC issues, such as communication failures and unusable ports, see SMC issues.