All Products
Search
Document Center

Elastic Compute Service:Monitor and diagnose eRDMA

Last Updated:Jun 25, 2026

Use CloudMonitor, eadm, iproute2, and the diagnose tool to monitor eRDMA traffic, locate faults, and evaluate network performance.

Prerequisites

eRDMA is installed and deployed on the target ECS instance. See Enable eRDMA for an ECS instance.

Monitor eRDMA with CloudMonitor

CloudMonitor tracks eRDMA status and supports custom alerts. See Custom monitoring.

View the monitoring metrics supported by eRDMA

  1. Log on to the CloudMonitor console.

  2. In the metric list search box, enter eri to view the monitoring metrics for eRDMA.

    image

Diagnose eRDMA with the eadm tool

eadm is a user-space tool deployed with the eRDMA driver. It provides real-time monitoring and diagnostics to help locate faults. Key features include:

  • Traffic monitoring and assisted diagnostics: Real-time traffic statistics for the entire device.

  • Query and set configurations: Configure the delay ACK feature and the CC algorithm.

The following are common eadm commands. For other commands, run eadm -h.

Warning

This tool is for diagnostics and debugging only. It may change in the future, and its availability is not guaranteed in all scenarios.

  • View help for the eadm command.

    eadm -h

    image

  • Monitor real-time traffic of an eRDMA device

    Requires driver version 0.2.34 or later.

    eadm stat -d <ibdev_name> -l

    <ibdev_name>: the eRDMA device name. Run ibv_devinfo to find it. If only one eRDMA device exists, you can omit -d <ibdev_name>.

    image

  • Get eRDMA device statistics, such as CM and verbs message counts and traffic counts.

    eadm stat -d <ibdev_name>

    <ibdev_name>: the eRDMA device name. Run ibv_devinfo to find it. If only one eRDMA device exists, you can omit -d <ibdev_name>.

    image

  • Get the current eRDMA driver version.

    eadm ver
Note

Other commands such as info, dump, and conf have usage constraints. Do not run them unless necessary.

Monitor and diagnose eRDMA with iproute2

iproute2 is a Linux networking toolkit that provides utilities such as ip and ss for managing network interfaces, routing tables, and traffic control. Use its built-in rdma command to monitor and diagnose the RDMA subsystem.

Note

Most Linux distributions, such as Alibaba Cloud Linux 3 and Ubuntu 20.04 or later, include iproute2 by default. See your operating system documentation for details.

  • Query eRDMA device status.

    rdma link

    image

  • Query eRDMA resource usage, such as the number of CQs, QPs, and MRs.

    Note

    In RDMA network communication, Queue Pair (QP), Completion Queue (CQ), Memory Region (MR), and verbs Opcode are the core components. They play important roles in RDMA communication and ensure high efficiency and low latency of RDMA network communication.

    See Capabilities and specifications of eRDMA.

    rdma res

    image

  • Query eRDMA performance statistics, such as connection counts, connection status, and packet counts.

    rdma -p stat

    image

Diagnose and evaluate eRDMA with the diagnose tool

The diagnose tool supports basic function checks, HPC environment checks, and latency checks for eRDMA.

Possible results of a diagnose check

  • PASS: The check passed.

  • SKIP: The current system version does not support this check.

  • FAIL: The check tool is not installed or the check failed. The failed command is listed in fail info.

  • Other INFO messages: eRDMA configuration details, such as installation mode, driver version, and CC algorithm.

Install diagnose

On an eRDMA-configured instance, download the diagnose tool:

  • Download from an internal URL

    wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
  • Download from a public URL

    wget https://mirrors.aliyun.com/erdma/tools/diagnose.py

How to use the diagnose tool

python diagnose.py -h

image

Diagnose basic eRDMA functions

The basic functional test verifies driver installation, network connectivity, and the eRDMA kernel driver's installation mode.

eRDMA basic function diagnostic items

Check item

Description

Expected output

Failure and solution

erdma device

Checks whether an eRDMA device exists.

PASS

FAIL: RDMA was not enabled for the primary ENI, or no secondary ENI with RDMA enabled was attached when the instance was created. See Enable eRDMA for an ECS instance.

erdma installed

Checks whether the eRDMA driver is correctly installed.

PASS

FAIL: The driver is not correctly installed. Verify the installation steps or reinstall the driver. See Step 2: Install the eRDMA driver for an ECS instance.

erdma loaded

Checks whether the eRDMA driver is correctly loaded.

PASS

FAIL: The driver is not loaded. This can occur if the host was not restarted after installation. Run modprobe erdma to resolve.

ibverbs loaded

Checks whether the ib_verbs driver is correctly loaded.

PASS

FAIL: Run modprobe ib_uverbs to resolve.

erdma tools

Checks whether eRDMA-related tools are installed.

PASS

FAIL: Run eadm | rdma | ibv_devinfo to identify missing tools. These tools are installed with the eRDMA driver. Verify the installation steps or reinstall. See Step 2: Install the eRDMA driver for an ECS instance.

hca detected

Checks whether the user-space driver correctly detects the eRDMA device.

PASS

FAIL: One of the prerequisite checks (erdma device, erdma installed, erdma loaded, or ibverbs loaded) failed. Ensure the eRDMA driver is installed and loaded.

hca active

Checks whether the ENI corresponding to the eRDMA device is in a normal state.

PASS

FAIL: The ENI corresponding to the eRDMA device is not UP. This may occur in older kernel versions. Run dhclient -v ethx to enable the eth device, then verify the eRDMA device is ACTIVE. See Verify the eRDMA configuration.

erdma stats

Checks whether the eRDMA device has error counters.

PASS

  • SKIP: The operating system may not support rdma stat.

  • FAIL: Error counters may exist. Provide the output of rdma -p stat when contacting technical support.

network config

Checks whether network connectivity is normal.

PASS

FAIL: Multiple network interfaces have IP addresses in the same subnet, which can cause eRDMA to malfunction.

erdma dmesg

Checks for eRDMA-related kernel alerts.

PASS

FAIL: Kernel alerts related to eRDMA are present. Check alert details and try reloading the driver.

atomic support

Checks whether the eRDMA device supports RDMA atomic operations.

PASS

FAIL: The eRDMA device does not support RDMA atomic operations. Ignore this if your application does not require atomic operations.

Note

An atomic operation performs memory operations at the atomic level to ensure integrity and consistency. This feature applies only to specific use cases.

go-back-n support

Checks whether the eRDMA device supports the Go-back-N feature.

PASS

  • SKIP: The eRDMA device may not support querying Go-back-N configuration.

  • FAIL: The eadm tool may not be installed, or the eRDMA device may not support Go-back-N.

Note

Go-back-N is an extended eRDMA feature for specific use cases. Ignore related errors if not required.

erdma install mode

Installation mode of the eRDMA kernel driver.

FAIL: Installation mode not found. This may occur if the erdma loaded check failed. Reinstall the eRDMA driver. See Step 2: Install the eRDMA driver for an ECS instance.

kernel driver version

Version of the eRDMA kernel driver.

Current eRDMA kernel driver version, for example, 0.2.37.

FAIL: Kernel driver version not found. This may occur if the erdma loaded or erdma tools check failed. Ensure the eRDMA driver is installed and loaded. See Verify the eRDMA configuration.

rdma-core version

Version of the eRDMA user-space driver.

eRDMA user-space driver version, for example, 44.1-2.

FAIL: User-space driver version not found. The user-space driver may not be correctly installed. Reinstall the eRDMA driver. See Step 2: Install the eRDMA driver for an ECS instance.

cc algorithm

CC algorithm currently used by eRDMA.

eRDMA CC algorithm, for example, hpcc_rtt.

FAIL: CC algorithm not found. This may occur if the erdma loaded or erdma tools check failed. Ensure the eRDMA driver is installed and loaded.

Procedure:

  1. Log on to the eRDMA-configured instance.

    See Connect to a Linux instance by using Workbench.

  2. Download the diagnose tool.

    • Download from an internal URL

      wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
    • Download from a public URL

      wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
  3. Check the basic functions of eRDMA:

    python diagnose.py -d

    The following is an example output. For diagnostic item descriptions, see eRDMA basic function diagnostic items.

    image

Diagnose the eRDMA HPC environment

HPC applications in an eRDMA environment may require additional dependencies and configurations. The diagnose tool checks for these dependencies.

eRDMA HPC environment dependency checks

The HPC dependency check verifies the CC algorithm, Go-back-N status, and DAPL 1.0/2.0 dependencies. If you do not use certain dependencies, ignore the related errors.

Check item

Description

Expected output

Failure and solution

cc algorithm

CC algorithm currently used by eRDMA.

eRDMA CC algorithm, for example, hpcc_rtt.

FAIL: CC algorithm not found. The eadm tool may not be correctly installed or may not support querying the CC algorithm.

go-back-n support

Checks whether the eRDMA device supports the Go-back-N feature.

PASS

  • SKIP: The eRDMA device may not support querying Go-back-N configuration.

  • FAIL: The eadm tool may not be installed, or the eRDMA device may not support Go-back-N.

This feature may affect HPC applications. Ignore if not required.

dapl1 install

Checks whether dapl1 is correctly installed.

PASS

FAIL: Shared libraries or configuration files for dapl1 are missing. Verify the dapl1 installation. Ignore if dapl1 is not required.

dapl1 config

Checks whether eRDMA is configured in the dapl1 configuration file.

PASS

FAIL: No eRDMA configuration found in the dapl1 config file. Add eRDMA configuration to the file. Ignore if dapl1 is not required.

dapl2 install

Checks whether dapl2 is correctly installed.

PASS

FAIL: Shared libraries or configuration files for dapl2 are missing. Verify the dapl2 installation. Ignore if dapl2 is not required.

dapl2 config

Checks whether eRDMA is configured in the dapl2 configuration file.

PASS

FAIL: No eRDMA configuration found in the dapl2 config file. Add eRDMA configuration to the file. Ignore if dapl2 is not required.

dapl2 test

Checks whether dapl2 dtest runs normally.

PASS

FAIL: dtest failed. dapl2 may not be correctly installed or configured.

Procedure:

  1. Log on to the eRDMA-configured instance.

    See Connect to a Linux instance by using Workbench.

  2. Download the diagnose tool.

    • Download from an internal URL

      wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
    • Download from a public URL

      wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
  3. Check HPC environment dependencies:

    python diagnose.py --hpc-check

    Example output. For diagnostic item descriptions, see eRDMA HPC environment dependency checks.

    image.png

Evaluate eRDMA network performance

Use the perftest feature in the diagnose tool to test network performance between instances.

  • Prerequisites

    Before running the test:

    • eRDMA is installed and deployed on all test nodes. See Enable eRDMA for an ECS instance.

    • Passwordless SSH is configured among all test nodes. See Configure passwordless SSH logon.

    • Python paramiko is installed on all test nodes.

      Note
      • The diagnose tool uses paramiko for remote connections.

      • Install paramiko with the following commands. Python 3 is recommended.

      Alibaba Cloud Linux/CentOS

      # python3
      sudo python3 -m pip install --upgrade pip
      sudo python3 -m pip install paramiko 
      # python2
      # If the pip module is not installed for Python 2, install python2-pip.
      sudo yum -y install python2-pip
      sudo python2 -m pip install --upgrade pip==20.3.4
      sudo python2 -m pip install paramiko 

      Ubuntu

      # python3
      sudo python3 -m pip install --upgrade pip
      sudo python3 -m pip install paramiko
      # python2
      # If python2-pip is not installed on the current node, install it.
      sudo apt install software-properties-common
      sudo add-apt-repository universe
      sudo apt update
      sudo apt install python2
      sudo curl https://bootstrap.pypa.io/pip/2.7/get-pip.py --output get-pip.py
      sudo python2 get-pip.py
      sudo python2 -m pip install --upgrade pip==20.3.4
      sudo python2 -m pip install paramiko
  • Procedure

    1. Log on to the eRDMA-configured instance.

      See Connect to a Linux instance by using Workbench.

    2. Download the diagnose tool.

      • Download from an internal URL

        wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
      • Download from a public URL

        wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
    3. Test eRDMA latency:

      python diagnose.py --perftest --hosts <n1> <n2> --user <username> --key-file </path/to/private_key>

      Parameters:

      • --hosts <n1> <n2>: Test nodes, separated by spaces. Replace <n1> <n2> with the private IP address of the eRDMA-enabled ENI on each node.

      • --user <username>: Username for passwordless SSH.

      • --key-file </path/to/private_key>: Absolute path to the private key file for SSH.

      Example output for two-instance latency test. See eRDMA network performance tests.

      Each table shows latency for different operations. Rows represent requesters, columns represent responders. Cell values show average latency in microseconds (99.9th percentile in parentheses).

      image.png