Mellanox ConnectX-6 is ignoring RDMA_WRITEs sent by proprietary hardware

Hi,

We have a mellanox connectx-6 NIC in one of our servers that is running Ubuntu 22.04.4 LTS with kernel 5.15.0-105-generic. We are running perftest on this machine to act as a server. The other end is our own hardware running our own software that acts as a client to facilitate QP negotiation using TCP connection. Based on the information received from the server, we craft RDMA_WRITE messages to the NIC. However, it seems the NIC doesn’t recognize our RDMA_WRITE packets. The counters (not even error counters) in /sys/class/infiniband/mlx5_4/ports/1/hw_counters do not increment when we send RDMA_WRITE packets to it. At the same time, the port_rcv_packets and unicast_rcv_packets under /sys/class/infiniband/mlx5_4/ports/1/counters increment when the RDMA_WRITEs are sent. We have captured all the packet exchanges (TCP and RDMA) on the NIC using tcpdump and we see that our RDMA_PACKETs are received fine with FCS bytes removed. So that rules out FCS errors. Using ethtool -S, we found that the Rx Phy counters also increase as expected. But the NIC does not send any ACK in response, nor does perftest BW test report having received any packet (the peak, avg and msg avg rates are all 0.0). I have attached the capture that was taken using tcpdump. Could someone please let me know why the NIC is ignoring our packets?
rdma_one_arm.pcap.gz (5.2 KB)

Here is the ibv_devinfo for the nic I am using:
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 20.35.3006
node_guid: b8ce:f603:00f8:bab2
sys_image_guid: b8ce:f603:00f8:bab2
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000225
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

lspci|grep mel -i
17:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
17:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
31:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
31:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
b1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
b1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
ca:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
ca:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]

perftest is run with ib_write_bw -d mlx5_4 (although the default is 5000 iterations, in this particular run, to debug the issue, our client sent RDMA packets for only 5 iterations. Even sending 5000 iterations does not resolve the issue)

rdma link show mlx5_4/1
link mlx5_4/1 state ACTIVE physical_state LINK_UP netdev ens106f0np0

Thanks,
Dibyendu

Hello @dibyendu.chakraborty,

Thank you for reaching out to our community with your query. To narrow down the issue and determine whether it’s related to our NIC or your software, have you attempted running perftest between two hosts equipped with ConnectX-6 NICs and our drivers? If so, did you observe the same issue?

The tcpdump capture alone may not be sufficient to ascertain why the RDMA counters are not incrementing. (FYR - ESPCommunity)

Based on the results of the aforementioned test, please consider opening a support case with us as necessary and providing a sysinfo snapshot from the host for our review. You can initiate a support case by emailing “Networking-support@nvidia.com”. Kindly note that an active support contract would be required for this service. For information regarding contracts, please don’t hesitate to contact our contracts team at “Networking-Contracts@nvidia.com”.

Thank you,
Bhargavi