All Products
Search
Document Center

Elastic Compute Service:Monitor and check eRDMA

Last Updated:Jan 26, 2024

You can monitor and check elastic Remote Direct Memory Access (eRDMA) to identify and resolve issues at the earliest opportunity, ensure system security, and efficiently manage and optimize system resources. This topic describes several methods and tools that you can use to monitor and check eRDMA.

Prerequisites

eRDMA is installed and configured on an Elastic Compute Service (ECS) instance. For information about how to configure eRDMA, see Configure eRDMA on an enterprise-level instance.

Use CloudMonitor to monitor eRDMA

You can use Alibaba Cloud CloudMonitor to monitor the working status of eRDMA. Perform the following steps to view the CloudMonitor metrics that are supported by eRDMA:

  1. Log on to the CloudMonitor Metric console.

  2. Enter eri in the search box above the metric list to search for the CloudMonitor metrics that are supported by eRDMA.

    Note

    Alternatively, you can customize metrics based on your business requirements to process and receive reports and alerts about eRDMA monitoring data. For more information, see Custom Monitoring.

Use eadm to monitor eRDMA

eadm is an in-house, user-space management tool that is automatically installed by an eRDMA driver on an ECS instance to provide diagnostics and real-time monitoring capabilities and help identify faults. eadm provides the following features:

  • The abilities to perform device-wide real-time traffic statistics, including the traffic monitoring and assisted diagnostics abilities.

  • The abilities to perform and query configurations, including the abilities to enable the debugging feature and configure congestion control (CC) algorithms.

The following section describes several common eadm commands. For information about other eadm commands, run the eadm -h command to obtain command helps.

Warning

eadm is used only for diagnostics and debugging purposes and is subject to changes. eadm may not be suitable for all scenarios.

  • Retrieve the supported primary command codes.

    eadm -h
  • Retrieve real-time traffic information about an eRDMA device.

    eadm stat -d <ibdev_name> -l

    <ibdev_name> specifies the name of the eRDMA device. You can run the ibv_devinfo command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the -d <ibdev_name> parameter.

  • Retrieve statistics about an eRDMA device, such as the number of cm and verebs messages and traffic volumes.

    eadm stat -d <ibdev_name>

    <ibdev_name> specifies the name of the eRDMA device. You can run the ibv_devinfo command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the -d <ibdev_name> parameter.

  • Retrieve the version information of the current eRDMA driver.

    eadm ver
Note

Limits apply when you run other eadm commands such as info, dump, and conf. We recommend that you do not use other eadm commands.

Use iproute2 to monitor eRDMA

Iproute2 is a next-generation toolkit that is used for TCP/IP networking and traffic control. Iproute2 is pre-installed in recent eRDMA versions and provides RDMA commands that can be used to monitor and check RDMA subsystems.

Note

The simple and well-structured commands in Iproute2 replace the commands in net-tools, such as ifconfig, arp, route, and netstat. You can use Iproute2 to manage network interfaces and route tables, and traffic. This allows administrators to quickly identify and troubleshoot network connectivity issues.

  • Query statistics about eRDMA devices, such as the number of cm and verebs messages and traffic volumes.

    rdma -p stat
  • Query the resource usage of eRDMA devices.

    rdma res
  • Query the status of eRDMA devices.

    rdma link

Use the diagnose tool to check eRDMA

You can use the diagnose tool to check basic eRDMA functionality, eRDMA high-performance computing (HPC) environments, and basic eRDMA latencies. This helps you effectively use eRDMA.

  1. Run the following commands to obtain the diagnose tool:

    wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
    # View how to use the diagnose tool.
    python diagnose.py -h
  2. Perform a check on eRDMA.

    Basic functionality check

    Run one of the following commands to check the basic functionality of eRDMA:

    python diagnose.py -d

    Or

    python diagnose.py --diagnose

    One of the following results is returned for each check item:

    • PASS: The check item passed the check.

    • SKIP: The check item does not support the check and is skipped.

    • FAIL: The required check tool is not installed or the check item failed the check. You can run the commands that are listed in the fail info section to check the FAIL items and troubleshoot issues.

    • Other INFO information: indicates eRDMA-related configuration information, such as the installation mode, driver versions, and CC algorithms.

    image.png

    In normal cases, the following command output that indicates that all check items passed the check is returned.

    image.png

    Sixteen check items are checked in the basic functionality check on eRDMA. The following table describes the check items, the expected check result for each item, and what to do if the items fail the check.

    Check item

    Description

    Expected result

    Error result and solution

    erdma device

    Whether eRDMA devices exist.

    PASS

    FAIL: You may not enable the eRDMA feature or add an eRDMA interface (ERI) as a secondary network interface controller (NIC) during instance creation. Enable eRDMA or add an ERI as a secondary NIC. For more information, see Configure eRDMA on an enterprise-level instance.

    erdma installed

    Whether eRDMA drivers are properly installed.

    PASS

    FAIL: eRDMA drivers are not properly installed. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see Configure eRDMA on an enterprise-level instance.

    erdma loaded

    Whether eRDMA drivers are properly loaded.

    PASS

    FAIL: eRDMA drivers are not properly loaded. This issue may occur when the drivers are installed before the instance restarts. Run the modprobe erdma command to resolve the issue.

    ibverbs loaded

    Whether the ib_verbs driver is properly loaded.

    PASS

    FAIL: The ib_verbs driver is not properly loaded. Run the modprobe ib_uverbs command to resolve the issue.

    erdma tools

    Check whether eRDMA-related tools are installed.

    PASS

    FAIL: Run the eadm|rdma|ibv_devinfo command to check for missing tools. In most cases, eRDMA-related tools are installed together with eRDMA drivers. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see Configure eRDMA on an enterprise-level instance.

    hca detected

    Whether eRDMA devices are detected by the user-space driver.

    PASS

    FAIL: eRDMA devices are not detected by the user-space driver. This issue occurs when the erdma device, erdma installed, erdma loaded, and ibverbs loaded check items fail the check. Check that eRDMA drivers are installed and properly loaded.

    hca active

    Whether the current device is enabled.

    PASS

    FAIL: This issue occurs when the elastic network interface (ENI) of the current eRDMA device is not in the running state. The issue may occur in specific early kernel versions. Run the dhclient -v ethx command to enable the ENI, and then check whether the eRDMA device is in the ACTIVE state.

    erdma stats

    Whether no error statistics about eRDMA devices exist.

    PASS

    • SKIP: The operating system may not support the rdma stat command.

    • FAIL: Error statistics about eRDMA devices may exist. When you ask for technical assistance, we recommend that you provide the rdma -p stat command output.

    network config

    Whether network connectivity is good.

    PASS

    FAIL: If the IP addresses of multiple NICs fall within the same subnet, eRDMA may not work as expected in specific scenarios.

    erdma dmesg

    Whether eRDMA-related alerts do not exist in the kernel.

    PASS

    FAIL: eRDMA-related alerts exist in the kernel. Check the error details of the alerts and reload drivers to resolve the issues.

    atomic support

    Whether the eRDMA device supports RDMA Atomic Operation.

    PASS

    FAIL: The current eRDMA device does not support RDMA Atomic Operation. If you do not need RDMA Atomic Operation, ignore the error.

    Note

    RDMA Atomic Operation is a feature that performs complete and consistent operations on memory at the atomic level and is suitable only for specific scenarios. If you do not need RDMA Atomic Operation, ignore the error.

    go-back-n support

    Whether the eRDMA device supports the Go-back-N feature.

    PASS

    • SKIP: The current eRDMA device may not support queries for Go-back-N configurations.

    • FAIL: The eadm tool may not be properly installed or the eRDMA device may not support the Go-back-N feature.

    Note

    Go-back-N is an extension of eRDMA that is suitable only for specific scenarios. If you do not need the Go-back-N feature, ignore the error.

    erdma install mode

    The eRDMA kernel-mode driver installation mode.

    • Standard: The eRDMA kernel-mode driver is installed in standard mode and supports only RDMA Connection Manager (CM) connections.

    • Compat: The eRDMA kernel-mode driver is installed in compatible mode and supports RDMA CM connections and out-of-band (OOB) connections. The driver uses TCP ports from the port range of 0x7790 to 0x779F.

    FAIL: The installation mode of the eRDMA kernel-mode driver is not detected. This issue may occur when the erdma loaded item does not meet requirements and fails the check. Re-install the eRDMA kernel-mode driver. For more information, see Configure eRDMA on an enterprise-level instance.

    kernel driver version

    The version of the eRDMA kernel-mode driver.

    The version number of the eRDMA kernel-mode driver. Example: 0.2.38.

    FAIL: The version of the eRDMA kernel-mode driver is not detected. This issue may occur when the erdma loaded or erdma tools item does not meet requirements and fails the check. Make sure that the eRDMA driver is installed and properly loaded. For more information, see Configure eRDMA on an enterprise-level instance.

    rdma-core version

    The version of the eRDMA user-mode driver.

    The version number of the eRDMA user-mode driver. Example: 44.3-1.

    FAIL: The version of the eRDMA user-mode driver is not detected. This issue may occur when the eRDMA user-mode driver is not properly installed. Re-install the driver. For more information, see Configure eRDMA on an enterprise-level instance.

    cc algorithm

    The CC algorithm of eRDMA.

    The CC algorithm of eRDMA. Example: cubic.

    FAIL: The CC algorithm of eRDMA is not detected. This issue may occur when the erdma loaded or erdma tools item does not meet requirements and fails the check. Make sure that eRDMA drivers are installed and properly loaded.

    eRDMA HPC environment check

    If you want to run HPC applications in your eRDMA environment, you may need additional dependencies and configurations. You can use the diagnose tool to check the dependencies that are required for an eRDMA HPC environment. If you do not use HPC applications, ignore this section.

    Run the following command to check the dependencies that are required for an eRDMA HPC environment:

    python diagnose.py --hpc-check

    In normal cases, the following command output is returned.

    image.png

    During the eRDMA HPC environment check, the following items about required dependencies are checked: the CC algorithm of eRDMA, whether Go-back-N is supported, DAPL 1.0-related items, and DAPL 2.0-related items. If you do not need the dependences, ignore the reported errors. For example, if you need only DAPL 2.0, ignore the errors that are reported about DAPL 1.0.

    Check item

    Description

    Expected result

    Error result and solution

    cc algorithm

    The CC algorithm of eRDMA.

    The CC algorithm of eRDMA. Example: cubic.

    FAIL: The CC algorithm of eRDMA is not detected. This issue may occur when the eadm tool is not properly installed or does not support queries for the CC algorithm of eRDMA.

    go-back-n support

    Whether the eRDMA device supports the Go-back-N feature.

    PASS

    • SKIP: The current eRDMA device may not support queries for Go-back-N configurations.

    • FAIL: The eadm tool may not be properly installed or the eRDMA device may not support the Go-back-N feature.

    The absence of the Go-back-N feature may affect HPC applications. If you do not need the feature, ignore the error.

    dapl1 install

    Whether DAPL 1.0 is properly installed.

    PASS

    FAIL: The shared libraries for DAPL 1.0 or the DAPL 1.0 configuration file does not exist. Check whether DAPL 1.0 is properly installed. If you do not need DAPL 1.0, ignore the error.

    dapl1 config

    Whether eRDMA configurations are included in the DAPL 1.0 configuration file.

    PASS

    FAIL: No eRDMA configurations exist in the DAPL 1.0 configuration file. Check the DAPL 1.0 configuration file and add eRDMA configurations to the file. If you do not need DAPL 1.0, ignore the error.

    dapl2 install

    Whether DAPL 2.0 is properly installed.

    PASS

    FAIL: The shared libraries for DAPL 2.0 or the DAPL 2.0 configuration file does not exist. Check whether DAPL 2.0 is properly installed. If you do not need DAPL 2.0, ignore the error.

    dapl2 config

    Whether eRDMA configurations are included in the DAPL 2.0 configuration file.

    PASS

    FAIL: No eRDMA configurations exist in the DAPL 2.0 configuration file. Check the DAPL 2.0 configuration file and add eRDMA configurations to the file. If you do not need DAPL 2.0, ignore the error.

    dapl2 test

    Whether the dtest command runs as expected for DAPL 2.0.

    PASS

    FAIL: The dtest command fails to run. DAPL 2.0 may not be properly installed or configured.

    eRDMA latency check

    Prerequisites

    Before you perform an eRDMA latency check, make sure that the following requirements are met:

    • eRDMA is properly installed and deployed on all nodes that you want to check. For more information, see Configure eRDMA on an enterprise-level instance.

    • Password-free SSH access is allowed between all nodes that you want to check. For more information, see Build a Hadoop environment.

    • Python paramiko dependencies are installed on all nodes that you want to check.

      Install Python paramiko dependencies

      Note

      Use one of the following sets of commands based on the instance operating system to install Python paramiko dependencies. The default Python version is Python 3. If you do not have special requirements for the Python version, we recommend that you use Python 3 to reduce configuration workload.

      • Alibaba Cloud Linux or Centos

        # python3
        sudo python3 -m pip install --upgrade pip
        sudo python3 -m pip install paramiko 
        # python2
        # If the Python version is Python 2 and python2-pip is not installed, install python2-pip.
        sudo yum -y install python2-pip
        sudo python2 -m pip install --upgrade pip==20.3.4
        sudo python2 -m pip install paramiko 
      • Ubuntu

        # python3
        sudo python3 -m pip install --upgrade pip
        sudo python3 -m pip install paramiko
        # python2
        # If python2-pip is not installed on the current node, install python2-pip.
        sudo apt install software-properties-common
        sudo add-apt-repository universe
        sudo apt update
        sudo apt install python2
        sudo curl https://bootstrap.pypa.io/pip/2.7/get-pip.py --output get-pip.py
        sudo python2 get-pip.py
        sudo python2 -m pip install --upgrade pip==20.3.4
        sudo python2 -m pip install paramiko

    Procedure

    Run the following command to check the eRDMA latency:

    python diagnose.py --perftest --hosts <n1> <n2> --user <username> --key-file </path/to/private_key>

    Take note of the following parameters:

    • --hosts <n1> <n2>: specifies the nodes that you want to check. Separate the nodes with spaces. Replace <n1> <n2> with the private IP addresses of ERIs on the nodes.

    • --user <username>: specifies the username that is used for password-free SSH logons. Replace <username> with an actual username.

    • --key-file </path/to/private_key>: specifies the absolute path of the private key file that is used for password-free SSH logons. Replace </path/to/private_key> with the actual absolute path of a private key file.

    The following command output that indicates the check results is returned.

    image.png

FAQ

How do I query the version of the current eRDMA kernel-mode driver?

If eRDMA is installed in standard mode, the eadm tool is automatically deployed. You can use the eadm tool to run the ver command to query the version of the current eRDMA kernel-mode driver.

eadm ver

How do I obtain the list of eRDMA devices in the current operating system?

  • Method 1: Run the ibv_devinfo command to query the details of all eRDMA devices that are available in the current operating system.

  • Method 2: If the operating system of the instance supports the rdma dev command, run the command to query the list of eRDMA devices in the operating system.

How do I query traffic statistics about eRDMA devices?

eRDMA devices whose driver versions are 0.2.34 or later support the traffic statistics feature.

  1. Run the following command to query the driver version to determine whether the traffic statistics feature is supported. If the driver version is 0.2.34 or later, the traffic statistics feature is supported.

    eadm ver
  2. Query real-time traffic statistics about eRDMA devices.

  • If only one eRDMA device is available, run the following command to query the real-time traffic statistics about the device:

    eadm stat -l
  • If multiple eRDMA devices are available, run the following command to query the real-time traffic statistics about each device:

    eadm stat -d <ibdev_name> -l

    <ibdev_name> specifies the name of the eRDMA device. You can run the ibv_devinfo command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name.