All Products
Search
Document Center

Container Service for Kubernetes:Run applications that use gRPC and RDMA verbs on eRDMA nodes in an ACK cluster

Last Updated:Apr 30, 2025

You can run applications that use Google Remote Procedure Call (gRPC) and Remote Direct Memory Access (RDMA) verbs on Enhanced Remote Direct Memory Access (eRDMA) nodes to enable applications to communicate through RDMA instead of gRPC. This reduces the latency between the parameter server and worker nodes, and accelerates distributed training.

Prerequisites

Procedure

The following procedure uses the tf_cnn_benchmark job as an example.

  1. Submit the TensorFlow job that uses eRDMA.

    arena submit tfjob --name=tf-ps-benchmark \
    --gpus=8 --workers=1 --ps=1 \
    --device=aliyun/erdma=1 \
    --hostNetwork true \
    --psImage=registry.cn-beijing.aliyuncs.com/acs/tf-benchmark:1.0 \
    --image=registry.cn-beijing.aliyuncs.com/acs/tf-benchmark:1.0 \
    	"CUDA_VISIBLE_DEVICES= python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
            --server_protocol=grpc+verbs \
            --model=resnet50 \
            --batch_size=16 \
            --data_format=NHWC"
  2. Query the eRDMA interface.

    $ ibv_devinfo
    hca_id:	rocep156s0
    	transport:			eRDMA
    	fw_ver:				0.2.0
    	node_guid:			0216:3eff:fe2c:b8f3
    	sys_image_guid:			0216:3eff:fe2c:b8f3
    	vendor_id:			0x1ded
    	vendor_part_id:			4223
    	hw_ver:				0x0
    	phys_port_cnt:			1
    		port:	1
    			state:			PORT_DOWN (1)
    			max_mtu:		1024 (3)
    			active_mtu:		1024 (3)
    			sm_lid:			0
    			port_lid:		0
    			port_lmc:		0x00
    			link_layer:		Ethernet
    
    hca_id:	rocep26s0
    	transport:			eRDMA
    	fw_ver:				0.2.0
    	node_guid:			0216:3eff:fe10:f8b0
    	sys_image_guid:			0216:3eff:fe10:f8b0
    	vendor_id:			0x1ded
    	vendor_part_id:			4223
    	hw_ver:				0x0
    	phys_port_cnt:			1
    		port:	1
    			state:			PORT_ACTIVE (4)
    			max_mtu:		1024 (3)
    			active_mtu:		1024 (3)
    			sm_lid:			0
    			port_lid:		0
    			port_lmc:		0x00
    			link_layer:		Ethernet
  3. Monitor eRDMA traffic.

    $ eadm stat -d rocep26s0 -l
    Monitoring rocep26s0...    (press CTRL-C to stop)
    
     15:59:56  rx:           0 B/s     0 p/s          tx:           0 B/s     0 p/s
    
    
     rocep26s0  /  traffic statistics
    
                               rx         |       tx
    --------------------------------------+------------------
      bytes                    11.06 GiB  |       11.18 GiB
    --------------------------------------+------------------
              max            52.43 MiB/s  |     52.10 MiB/s
          average             4.03 MiB/s  |      4.07 MiB/s
              min                  0 B/s  |           0 B/s
    --------------------------------------+------------------
      packets                    8406769  |         8546764
    --------------------------------------+------------------
              max              38990 p/s  |       37488 p/s
          average               2988 p/s  |        3038 p/s
              min                  0 p/s  |           0 p/s
    --------------------------------------+------------------
      time                 46.88 minutes

    The preceding output indicates that eRDMA traffic is identified in real time.