Hi,
We have a mellanox connectx-6 NIC in one of our servers that is running Ubuntu 22.04.4 LTS with kernel 5.15.0-105-generic. We are running perftest on this machine to act as a server. The other end is our own hardware running our own software that acts as a client to facilitate QP negotiation using TCP connection. Based on the information received from the server, we craft RDMA_WRITE messages to the NIC. However, it seems the NIC doesn’t recognize our RDMA_WRITE packets. The counters (not even error counters) in /sys/class/infiniband/mlx5_4/ports/1/hw_counters do not increment when we send RDMA_WRITE packets to it. At the same time, the port_rcv_packets and unicast_rcv_packets under /sys/class/infiniband/mlx5_4/ports/1/counters increment when the RDMA_WRITEs are sent. We have captured all the packet exchanges (TCP and RDMA) on the NIC using tcpdump and we see that our RDMA_PACKETs are received fine with FCS bytes removed. So that rules out FCS errors. Using ethtool -S, we found that the Rx Phy counters also increase as expected. But the NIC does not send any ACK in response, nor does perftest BW test report having received any packet (the peak, avg and msg avg rates are all 0.0). I have attached the capture that was taken using tcpdump. Could someone please let me know why the NIC is ignoring our packets?
rdma_one_arm.pcap.gz (5.2 KB)
Here is the ibv_devinfo for the nic I am using:
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 20.35.3006
node_guid: b8ce:f603:00f8:bab2
sys_image_guid: b8ce:f603:00f8:bab2
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000225
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
lspci|grep mel -i
17:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
17:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
31:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
31:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
b1:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
b1:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
ca:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
ca:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
perftest is run with ib_write_bw -d mlx5_4 (although the default is 5000 iterations, in this particular run, to debug the issue, our client sent RDMA packets for only 5 iterations. Even sending 5000 iterations does not resolve the issue)
rdma link show mlx5_4/1
link mlx5_4/1 state ACTIVE physical_state LINK_UP netdev ens106f0np0
Thanks,
Dibyendu