RoCE testing between virtual and physical machines

Greetings to all,

I’ve created two virtual machines on VMware ESXi 8.0 with PVRDMA for RoCEv2 and successfully tested RoCE functionality between them using ib_send_bw and ib_write_bw, achieving bandwidth speeds of up to 100 GBit/s. Now, I’d like to check RoCE between a physical Nvidia DGX1 machine with four ConnectX-4 cards and a virtual machine. I plan to use four single-port HCAs of Mellanox cards as separate interfaces, each carrying the same untagged VLAN 1 for RoCE. Bonding will be configured only for TCP traffic with VLAN 500. However, I’m unable to perform RoCE testing. I can ping from the VM to the DGX, but tools like ib_write_bw fails and rping is not working. Is it possible to perform RoCE testing between a virtual machine with PVRDMA based rocep2s3f1 adapter and a physical Nvidia DGX1 machine with mlx5_0-3 adapters? How to correctly setup PFC on the virtual machine and DGX1?

Output of ibdev2netdev command on DGX1

mlx5_0 port 1 ==> bond1 (Up)
mlx5_1 port 1 ==> bond0 (Up)
mlx5_2 port 1 ==> bond1 (Up)
mlx5_3 port 1 ==> bond0 (Up)

My setup consists of two Dell PowerEdge R6525 ESXi hosts with two Mellanox ConnectX-6 Single Port NICs connected to a Cumulus OS-based MLAG peerlinked switches (Mellanox Spectrum SN3700). RoCE is enabled in lossless mode, and PFC priority is set to 3.

I appreciate any insights or suggestions that you can provide, and I look forward to hearing from the experts here. Thank you in advance for your help!

Best regards,

Shakhizat

Are you use the same version perftest tool?
Can you show the error you got when you do test?
Thanks,
Suo

Hi @zhangsuo, thanks for your reply, @dwaxman suggested already to update perftest. There are different yes. Currently i am using rping and uddady to check the RoCE. I suspect i haven’t configured correctly PFC on the DGX1.

Output of rping,

cma event RDAM_CM_EVENT_UNREACHABLE, error -110
wait for connected state 4
connect error -1

Best regards,
Shakhizat

@zhangsuo, @dwaxman, I also suspect that instead of PVRDMA, I should use SR-IOV for virtual machines on VMware ESXi. I just discovered that Lustre does not support PVRDMA. Since we have DDN storage, this may be an issue. Actually we have active techinical support for our DGX A100 from Nvidia, but not for DGX1. I use DGX1 is a test bed system.

Hi @zhangsuo, just fyi I was able to test RoCE via SR-IOV apater on the virtual machines. Could you please give your recomendations or share your best practices about how to correctly cofigure PFC from the switch and DGX ends. I will really appreciate it.

Hi, Shakhizat,
You can refer:
https://mellanox.my.site.com/mellanoxcommunity/s/article/howto-configure-roce-over-a-lossy-fabric--ecn--end-to-end-using-connectx-4-and-spectrum--trust-l3-x

Thanks,
Suo

1 Like

Hi @zhangsuo, thanks for your informative reply. I have a question. I’ve checked the packets via Wireshark by utilizing the ib_send_bw tool from the virtual machine to DGX. I made a weird observation - I only saw TCP packets, no UDP or RoCE. I suppose that my 100GB/s interface working via TCP, not RoCE, isn’t it?
image

You need Compile libpcap with rdma support (version 1.9 and higher are expected).
And you must use RDMA device (e.g. mlx5_0)。

Thanks,
Suo

1 Like

Hi @zhangsuo, you absolutely nailed it, after building libpcap from source and rebuilding tcpdump, I was finally able to run the following command successfully:

 sudo tcpdump -i mlx5_0 -s 0 -w rdma.pcap

Previously, with different versions of tcpdump and libpcap, I consistently encountered this error:

tcpdump: mlx5_0: No such device exists
(SIOCGIFHWADDR: No such device)

After applying your solution, I captured traffic for 5-10 seconds, resulting in a file size of approximately 800MB. I was then able to view the following packets using Wireshark.

Thanks a lot!

Best regards,
Shakhizat

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.