ConnectX-4 RoCE speed less than expected

k.bodi2 · July 11, 2023, 3:04pm

Hey,
I’m currently working on a simple setup of two systems connected via two ConnectX-4 Lx NICs and a fiber-optics cable with SFP28 modules on both ends. We are aiming to utilize GPUDirect RDMA, whereas only one system has a GDR supported GPU (Here: RTX A5000). Currently I’m looking for advices on how to improve the RoCE throughput, because we expect to get ~ 25 GBit/s but instead we were only able to achieve an average of 14.35 GBit/s (reported by ib_send_bw).

The setup looks as follows:

System #1:
CPU: AMD Ryzen Threadripper PRO 3955WX 16-Cores
GPU: NVIDIA RTX A5000
NIC: ConnectX-4 Lx
System #2:
CPU: AMD Ryzen 9 5900X 12-Core Processor
GPU: NVIDIA GeForce GTX 1650 (UUID: GPU-1e40bc5f-5675-d381-6ed0-ec9c0b990820)
NIC: ConnectX-4 Lx

Used fiber-optics cable: https://www.fs.com/de/products/40233.html?attribute=803&id=18479

All NICs are connected to PCIe 3.0 x16 (each Port has x8 which equals 64 GBit/s per Port), so I would expect that is not the limiting factor. Following commands have been run by us:

System #1 (Server):
$ ib_send_bw -F -d mlx5_0 -a --report_gbits

System #2 (Client):
$ ib_send_bw -F -d mlx5_0 -a 20.4.3.219 --report_gbits

This is the report by ib_send_bw:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		  Device         : mlx5_0
 Number of qps   : 1		  Transport type : IB
 Connection type : RC		  Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 100
 Mtu             : 4096[B]  (EDIT: Use "ifconfig enpX mtu 9000" to replicate the mtu size)
 Link type       : Ethernet
 GID index       : 5
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0092 PSN 0x52d594
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:20:04:03:219
 remote address: LID 0000 QPN 0x0093 PSN 0x27fd9c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:20:04:03:220
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          1000           0.000000            0.054278            3.392360
 4          1000           0.000000            0.094776            2.961734
 8          1000             0.00               0.22   		       3.407039
 16         1000             0.00               0.44   		       3.402634
 32         1000             0.00               0.87   		       3.404835
 64         1000             0.00               1.76   		       3.434656
 128        1000             0.00               3.49   		       3.408432
 256        1000             0.00               6.96   		       3.397778
 512        1000             0.00               12.31  		       3.004447
 1024       1000             0.00               13.25  		       1.617233
 2048       1000             0.00               13.77  		       0.840626
 4096       1000             0.00               14.03  		       0.428310
 8192       1000             0.00               14.21              0.216778
 16384      1000             0.00               14.24  		       0.108639
 32768      1000             0.00               14.31  	      	   0.054605
 65536      1000             0.00               14.33  		       0.027337
 131072     1000             0.00               14.34  		       0.013677
 262144     1000             0.00               14.35  		       0.006841
 524288     1000             0.00               14.35  		       0.003421
 1048576    1000             0.00               14.35  		       0.001711
 2097152    1000             0.00               14.35  	      	   0.000855
 4194304    1000             0.00               14.35  	      	   0.000428
 8388608    1000             0.00               14.35  	      	   0.000214
---------------------------------------------------------------------------------------

Other configuration informations:

$ ibv_devinfo
hca_id:	mlx5_0
	transport:			InfiniBand (0)
	fw_ver:				14.32.1010
	node_guid:			08c0:eb03:00cb:9382
	sys_image_guid:			08c0:eb03:00cb:9382
	vendor_id:			0x02c9
	vendor_part_id:			4117
	hw_ver:				0x0
	board_id:			MT_2420110034
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		4096 (5)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

$ ibdev2netdev
mlx5_0 port 1 ==> enp5s0f0np0 (Up)
mlx5_1 port 1 ==> enp5s0f1np1 (Down)

$ ethtool enp5s0f0np0
Settings for enp5s0f0np0:
	Supported ports: [ FIBRE ]
	Supported link modes:   1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Supported FEC modes: None BaseR RS
	Advertised link modes:  1000baseKX/Full 
	                        10000baseKR/Full 
	                        25000baseCR/Full 
	                        25000baseKR/Full 
	                        25000baseSR/Full 
	Advertised pause frame use: Symmetric
	Advertised auto-negotiation: Yes
	Advertised FEC modes: RS
	Link partner advertised link modes:  Not reported
	Link partner advertised pause frame use: No
	Link partner advertised auto-negotiation: Yes
	Link partner advertised FEC modes: Not reported
	Speed: 25000Mb/s
	Duplex: Full
	Port: FIBRE
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
	Current message level: 0x00000004 (4)
			       link
	Link detected: yes

Is there anything else I should take care of? I appreciate any help or advice on this topic.

k.bodi2 · July 11, 2023, 3:55pm

I was wrong - the mainboard of system #2 is limiting the PCIe width to x2 for PCIe 3.0 which results in max. 16 GBit/s theoretical throughput and is close to the reported bandwidth of ib_send_bw.

See the output of lspci -vv:

		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <4us
			    ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			    ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
--->    LnkSta:	Speed 8GT/s (ok), Width x2 (downgraded)
			    TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt

I will switch the NIC to another slot and let you know, whether this has improved the RoCE performance.

xiaofengl · July 12, 2023, 9:20am

1.SEND is not real RDMA, it just copy data to NIC buff. You need use ib_write_bw.

why not use -x to configure GID index?

3.To use GDR you need locate HCA nd GPU on same pcie bridge, and disable pcie ACS

4.To use GDR you need use nv_peer_mem kernel driver

5.To test RDMA perf you need bind perf test to locale NUMA core by taskset/numactl etc.

k.bodi2 · July 12, 2023, 1:08pm

First: Switching the CX4 to another slot allows now to utilize the full bandwidth of the SFP28 module (25 GBit/s).

Thanks, you are right. I was lucky the perftest tools used the correct GID (here: 5 for RoCEv2).

ACS is disabled, because we have tested GPUDirect Storage (GDS) before and used gdscheck -p to see, if we have disabled IOMMU and ACS (all done in BIOS) as required by GDS. This is the output of nvidia-smi topo -m for system #1:

	    GPU0	NIC0	NIC1	CPU Affinity	NUMA Affinity
GPU0	 X 	    PHB	    PHB	    0-31		    N/A
NIC0	PHB     X 	    PIX		
NIC1	PHB	    PIX	    X

nv_peermem is loaded on system #1. Since system #2 doesn’t has any supported GPU, there is no nv_peermem available. Checked via lsmod | grep peer.

I haven’t heard anything about this so far. Do you have any references/manuals explaining this in detail?

Two general question:

When running ib_write_bw I see one core on each system is fully utilized (100%). I’m not sure but I expected that the CPU is less utilized thus experiencing the big benefit of using RDMA by being almost independent of CPU performance.
Since I’m now able to utilize the specified bandwidth - I would like to know whether it is possible to send gpu memory data from system #1 via NIC utilizing GDR without having a supported GPU on system #2? We just want to send data directly (bypassing CPU and system memory) from the RTX A5000 via RoCE. The receiver is an FPGA connected via Ethernet which just fast-forwards the data with very low latency in an appropriate format to another system.

xiaofengl · July 13, 2023, 2:41am

3.PHB not good, since it cross host pcie bridge, you need relocate, make it as PXB better.

5.NO manual mentioned. You need check “mst status -v” see where is the HCA PCIE locate on NUMA, then check lscpu identify cpu core to use. Then run test prefix by “taskset -c <core nu#>” etc.

gq,

1.it is work as design, rmda verbs polling on kernel will utilized 1 core 100%. It is huge benefit already, if you look at TCP/IP cpu handled stack etc, it at lease use serval core 100% to reach line rate.

2.it is possible GPU mem ↔ CPU mem RDMA.
serval test method,

perftest with -use-cude, need rebuild code,

OSU BW test,

osu_bw D H

http://mvapich.cse.ohio-state.edu/static/media/mvapich/README-OMB.txt

k.bodi2 · July 13, 2023, 2:58pm

Thank you very much so far, @xiaofengl!

I agree, but the used mainboard does not allow a better physical connection. We will certainly look for another mainboard which is more suitable for RDMA.

I guess this ain’t required for our setup, since our mainboard provides only one socket/cpu whereas NUMA is from my understanding a concept for virtually managing multiple cpu’s connected via one system bus. However, it makes somehow sense to set the CPU affinity for the benchmark application.

Based on this I started an investigation to understand respectively experience the real benefit of RoCE in our minimal P2P setup (no switches). I hope using qperf is a reliable way to compare these two communication protocols. First, qperf was started on one system as a server. Next, we executed following commands and their corresponding results on the other system:

$ qperf -t 20 -cm1 -vv -m 512K --use_bits_per_sec 20.4.3.219 rc_bw
rc_bw:
    bw                =     23.1 Gb/sec
    msg_rate          =     5.52 K/sec
    msg_size          =      512 KiB (524,288)
    time              =       20 sec
    timeout           =        5 sec
    use_cm            =        1 
    send_cost         =     4.28 ms/GB
    recv_cost         =     4.49 ms/GB
    send_cpus_used    =     1.25 % cpus
    send_cpus_user    =      0.2 % cpus
    send_cpus_kernel  =     1.05 % cpus
    send_real_time    =       20 sec
    send_cpu_time     =      250 ms
    send_bytes        =     58.4 GB
    send_msgs         =  111,394 
    send_max_cqe      =        1 
    recv_cpus_used    =      1.3 % cpus
    recv_cpus_user    =      0.4 % cpus
    recv_cpus_intr    =     0.35 % cpus
    recv_cpus_kernel  =     0.55 % cpus
    recv_real_time    =       20 sec
    recv_cpu_time     =      260 ms
    recv_bytes        =     57.9 GB
    recv_msgs         =  110,369 
    recv_max_cqe      =        1

$ qperf -t 20 -cm1 -vv -m 512K --use_bits_per_sec 20.4.3.219 tcp_bw
tcp_bw:
warning: -cm1 set but not used in test tcp_bw
    bw                =     23.5 Gb/sec
    msg_rate          =     5.61 K/sec
    msg_size          =      512 KiB (524,288)
    time              =       20 sec
    timeout           =        5 sec
    send_cost         =      156 ms/GB
    recv_cost         =      285 ms/GB
    send_cpus_used    =       46 % cpus
    send_cpus_user    =     0.15 % cpus
    send_cpus_intr    =     27.3 % cpus
    send_cpus_kernel  =     18.4 % cpus
    send_cpus_iowait  =     0.05 % cpus
    send_real_time    =       20 sec
    send_cpu_time     =     9.19 sec
    send_bytes        =     58.8 GB
    send_msgs         =  112,230 
    recv_cpus_used    =     83.8 % cpus
    recv_cpus_user    =     0.15 % cpus
    recv_cpus_intr    =     34.4 % cpus
    recv_cpus_kernel  =     49.1 % cpus
    recv_cpus_iowait  =     0.05 % cpus
    recv_real_time    =       20 sec
    recv_cpu_time     =     16.8 sec
    recv_bytes        =     58.8 GB
    recv_msgs         =  112,222

From my understaing, the cost for using RDMA with QP RC is a lot less than compared to TCP. In the following a short comparison:

send cost: TCP costs by a factor of 36.3 more than RDMA
recv cost: TCP costs by a factor of 63.5 more than RDMA

Let’s check the latency:

$ qperf -t 20 -cm1 -vv --use_bits_per_sec 20.4.3.219 rc_lat
rc_lat:
    latency          =  5.86 us
    msg_rate         =   171 K/sec
    msg_size         =     1 bytes
    time             =    20 sec
    timeout          =     5 sec
    use_cm           =     1 
    loc_cpus_used    =  55.5 % cpus
    loc_cpus_user    =   8.4 % cpus
    loc_cpus_intr    =  22.3 % cpus
    loc_cpus_kernel  =  24.8 % cpus
    loc_real_time    =    20 sec
    loc_cpu_time     =  11.1 sec
    loc_send_bytes   =  1.71 MB
    loc_recv_bytes   =  1.71 MB
    loc_send_msgs    =  1.71 million
    loc_recv_msgs    =  1.71 million
    rem_cpus_used    =  42.8 % cpus
    rem_cpus_user    =  11.9 % cpus
    rem_cpus_intr    =  17.5 % cpus
    rem_cpus_kernel  =  13.3 % cpus
    rem_real_time    =    20 sec
    rem_cpu_time     =  8.56 sec
    rem_send_bytes   =  1.71 MB
    rem_recv_bytes   =  1.71 MB
    rem_send_msgs    =  1.71 million
    rem_recv_msgs    =  1.71 million

$ qperf -t 20 -cm1 -vv --use_bits_per_sec 20.4.3.219 tcp_lat
tcp_lat:
warning: -cm1 set but not used in test tcp_lat
    latency          =  7.78 us
    msg_rate         =   128 K/sec
    msg_size         =     1 bytes
    time             =    20 sec
    timeout          =     5 sec
    loc_cpus_used    =    50 % cpus
    loc_cpus_user    =   1.1 % cpus
    loc_cpus_intr    =  29.5 % cpus
    loc_cpus_kernel  =  19.4 % cpus
    loc_real_time    =    20 sec
    loc_cpu_time     =    10 sec
    loc_send_bytes   =  1.28 MB
    loc_recv_bytes   =  1.28 MB
    loc_send_msgs    =  1.28 million
    loc_recv_msgs    =  1.28 million
    rem_cpus_used    =  50.2 % cpus
    rem_cpus_user    =  0.65 % cpus
    rem_cpus_intr    =  34.6 % cpus
    rem_cpus_kernel  =  14.9 % cpus
    rem_real_time    =    20 sec
    rem_cpu_time     =  10.1 sec
    rem_send_bytes   =  1.28 MB
    rem_recv_bytes   =  1.28 MB
    rem_send_msgs    =  1.28 million
    rem_recv_msgs    =  1.28 million

It seems TCP is 24.7 % slower than RDMA. I expected an even bigger difference based on all readings so far, but it’s likely that I have to optimize my setup first. One way to reduce the latency for RoCE is by enabling the pooling mode in qperf which reduces the latency even more:

$ qperf -t 20 -cm1 -cp1 -vv --use_bits_per_sec 20.4.3.219 rc_lat
rc_lat:
    latency          =  2.26 us
    msg_rate         =   442 K/sec
    msg_size         =     1 bytes
    poll_mode        =     1 
    time             =    20 sec
    timeout          =     5 sec
    use_cm           =     1 
    loc_cpus_used    =   100 % cpus
    loc_cpus_user    =   100 % cpus
    loc_cpus_intr    =  0.05 % cpus
    loc_cpus_kernel  =   0.1 % cpus
    loc_real_time    =    20 sec
    loc_cpu_time     =    20 sec
    loc_send_bytes   =  4.42 MB
    loc_recv_bytes   =  4.42 MB
    loc_send_msgs    =  4.42 million
    loc_recv_msgs    =  4.42 million
    rem_cpus_used    =   100 % cpus
    rem_cpus_user    =   100 % cpus
    rem_cpus_intr    =  0.05 % cpus
    rem_cpus_kernel  =   0.1 % cpus
    rem_real_time    =    20 sec
    rem_cpu_time     =    20 sec
    rem_send_bytes   =  4.42 MB
    rem_recv_bytes   =  4.42 MB
    rem_send_msgs    =  4.42 million
    rem_recv_msgs    =  4.42 million

Pooling introduces an even bigger gap between TCP and RDMA. As a result, TCP is now 71 % slower than RDMA.

@xiaofengl Can you please confirm my findings? Am I doing everything right while investigating RDMA performance? I’m fairly sure there are potential improvements, but we first need to understand whether our setup works in an expected way as a base for further investigations and tests.

xiaofengl · July 14, 2023, 1:48am

Yes, I think it is. Your perf result base on your setup no issue.

For compare RDMA with TCP, latency is not make sense. If you look into TCP stack you can see it can’t guarantee latency, it base on sliding window and dynamic response time. RDMA (RoCE Infiniband) base on hardware control ACK. In the mean time, TCP/IP not design for High performance network, drop/retransmit is also problem. You may get good performance result on idle system, but if network/server high load it is another thing.

system · July 28, 2023, 1:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
RDMA not working with ConnectX-6 Software And Drivers iterations , bytes	2	7823	January 29, 2022
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2605	April 11, 2019
RDMA GPU Direct Slow CUDA Programming and Performance	10	2224	February 13, 2019
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	3793	November 7, 2023
POWER9 GPUDirect poor performance (39Gb/s Connectx-5 to Tesla only) CUDA Programming and Performance	7	1160	April 26, 2020
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2232	October 12, 2021
PCI Express Latency and how to decrease it CUDA Programming and Performance	7	19036	January 31, 2011
High CPU usage on Jetson TX2 with GigE fully loaded Jetson TX2 hw , kernel , performance	12	2470	October 18, 2021
How to optimize my cuda code? CUDA Programming and Performance	14	1485	June 28, 2023
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2097	January 18, 2023

ConnectX-4 RoCE speed less than expected

Related topics