Dear Nvidia support,
we are experiencing troubles in our RoCEv2 MLNX_OFED based application using a Linux PC with Ubuntu v22 and a Mellanox Connect-X5 card with one port @ 100Gbit/s.
The application task is periodically receiving from a 100Gb interface proprietary FPGA a sequence of 304128 frames, each containing an UD Send message with a 920 bytes payload.
Shortly said, we can only get data dispatching to a IBV_QPT_RAW_PACKET based Queue Pair but we cannot set a steering rule capable of dispatching such messages to a IBV_QPT_UD based Queue Pair. When we try to do this, we simply don’t receive anything from the completion queue (ibv_poll_cq always returns 0) and we have no clues of why this traffic is silently dropped in the Mellanox card.
It is essential for us understanding if there is something to fix on the sender side, or if we made some mistakes in programming the IB verbs on the receiver side; unfortunately, no counters provided us so far any clue, and this is why we are asking for your support.
We enclose the relevant source code for both cases: the non-receiving code is named rdma.c, while the “working” (but unacceptable for our final application scope of work) Raw packet QP based version is named sniffer.c
When we inspect the incoming frames using Wireshark, what we see appears to be a sequence of regular UD Send exactly as we expect. The only field we have no direct mean of checking for correctness is the ICRC. Short of better ideas we actually tried a sender side computation for double check, which gives us mismatching values, but we have no special reason to trust our sender side algorithm more than the calculation in the FPGA which we know being based on open source code. Your review of the ICRC calculation in the “working” sniffer code (eval_icrc) would be also appreciated.
An excerpt of the Wireshark display is attached for your convenience (please note that the destination port you see, 4784, is actually replaced with the correct 4791 in the running tests, such as the Destination Queue Pair which is 2 in the capture but is replaced with the output qp_num in the real application).
Waiting for your reply & suggestion.
Thank you in advance
Dear Roberto,
I run working UD application on OFED system and captured packet.
I see only one difference in my capture MinReq=1 in BTH.
I assume dest QP, Queue key and Invariant CRC are correct in your packet.
Try to set this flag and see if this is helping.
Best regards,
Michael
first of all, thanks for your valuable support We examined your screenshot and, besides byte 43 were we have the Mig Req difference you pointed, we found some other differences in the frame I would like to discuss.
At byte 15 we see that you have 0x02 (ECN Capable Transport(1)), while we set 0x0 (not ECT).
Does this has a relevance in your opinion? We will try setting 0x02 as you do.
At bytes 18-19 we see in your frame a value for Identification (2059 in your sample).
We set this to zero and we did’t expect to be meaningful.
Annex17_RoCEv2.pdf at page 5 does not make any reference to this field.
Could this be an issue in your opinion?
Are there any rules to follow and, if this is the case, do you have any normative reference besides Annex17_RoCEv2.pdf?
At bytes 20-21 wee see in you frame the correct value 0x4000 wich corresponds to flags ‘010’ (don’t fragment). I noticed our frame has an incorrect ‘000’. We will try setting ‘010’ as you do.
Byte 22 TTL: Although I don’t expect this to be an actual issue, we will use your value 255 instead of our 128
Payload length: we see it is 1024 in your example.
Our payload is actually 920 bytes long. We suspect there is some kind of restriction on being a power of 2. Can you please clarify this? This issue has a significant impact on our architecture.
Last, we have an issue in inspecting the pcap for RoCE Packets when we use tools like udaddy for generating traffic.
We followed the instructions we found in How-To-Enable-Verify-and-Troubleshoot-RDMA for using ibdump, but it returns an error (unfortunately I cannot attach you details now).
Do you have any suggestion about this?
ECN is not related to getting or not getting completion of receive WQE
This is related to congestion control only.
This is IP Identification field. This value doesn’t matter. I see this is is incrementing for each packet
I suggest trying don’t fragment mode
TTL is not issue. The value 1 can be not enough, nut any value > 1 is good
No, this is no restriction. I just tried the size 1024 with following commands:
[michaelbe@l-csi-1331h ~]$ ib_send_bw -d mlx5_0 -x 3 --connection=UD -s 1024
[michaelbe@l-csi-1330h ~]$ ib_send_bw -d mlx5_0 -x 3 --connection=UD 11.7.157.146 -s 1024
I’m using tcpdump with ib-device name, for example,
[michaelbe@l-csi-1330h ~]$ ibdev2netdev
mlx5_0 port 1 ==> ens2f0 (Up)
[michaelbe@l-csi-1330h ~]$ sudo tcpdump -i ens2f0 -w /tmp/tcpdump.pcap
This will not capture RoCE traffic because ens2f0 is network device
[michaelbe@l-csi-1330h ~]$ sudo tcpdump -i mlx5_0 -w /tmp/tcpdump.pcap
This will capture RoCE traffic because mlx5_0 is IB device
I suggest using ethtool on receive side to check discards
Hi Michael,
we eventually succeeded in establishing a working sender-receiver flow from our FPGA to the UD receiver based on mellanox RDMA traffic, so first of all thanks for your appreciated support!
However, hopefully you can help me in a couple of topic:
we noticed that each UD Receive buffer does not contain just the plain payload: 40 extra bytes are present.
Looking at Annex17_RoCEv2.pdf page 12 we see the following:
A17.4.5.2 SCATTERING OF THE L3 HEADER IN UD The first 40 bytes of user posted UD Receive Buffers are reserved for the L3 header of the incoming packet (as per the InfiniBand Spec Section 11.4.1.2). In RoCEv2, this area is filled up with the IP header. IPv6 header uses the entire 40 bytes. IPv4 headers use the 20 bytes in the second half of the reserved 40 bytes area (i.e. offset 20 from the beginning of the receive buffer). In this case, the content of the first 20 bytes is undefined
This is not exactly the behaviour we experience: actually, 20 zeroed bytes are added before the data, data start at offset 20, and 20 extra bytes are added at the end (I could’t check the actual content, but we actually don’t care).
Having this overhead data interspread in our buffers is quite a nuisance for us as it force to perform many CPU based memory copy operations just to get the plain data.
Do you know of any way of avoiding, or at least reducing it (e.g. by array-like catenation of multiple buffers) ?
As you sugget we tried to use the tcpdump command to sniffing the RDMA data (with mlx5_0 device) but we ever get errors and we don’t understand what is wrong in our configuration / setup.
Below the info of Mellanox Card and the tcpdump call we use and relative error message::
gw@PSS-1-Lab:~$ ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.35.2000
Hardware version: 0
Node GUID: 0x1070fd03001cc334
System image GUID: 0x1070fd03001cc334
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x1270fdfffe1cc334
Link layer: Ethernet
gw@PSS-1-Lab:~$ sudo tcpdump -i mlx5_0 -w ./test_rdma.pcap
tcpdump: mlx5_0: No such device exists
(SIOCGIFHWADDR: No such device)
In case it helps, below some info about tcpdump version and linux lib for pkt capturing we have on our Linux machine (Ubuntu desktop v22):
gw@PSS-1-Lab:~$ tcpdump -help
tcpdump version 4.99.1
libpcap version 1.10.1 (with TPACKET_V3)
OpenSSL 3.0.2 15 Mar 2022
Hi Roberto,
Regarding the first question, I don’t think, we can change this behavior.
Did you try using RC instead UD? Can this work for you?
Regarding the second question,
Probably, you don’t have mlx5_0 device.
Can you call ibdev2netdev? Or may be mst status -v? (to see the IB devices loaded on your system)
BR,
Michael.
Try to use tcpdump version 4.9.2. If it won’t work, further investigation should be done via a support case. If you have the proper entitlement for the cards, you can open a case by sending an email enterprisesupport@nvidia.com.
Hi Michael,
about the 40 extra byte issue:
Instead of dealing with the complexity of the Reliable Connection message exchange (ACK, NAK etc), we would rather consider a simpler transition to SEND on Unreliable Connection.
According the the previously quoted Annex17_RoCEv2.pdf, page 12, the 40 bytes packet overhead is expected in the UD only, not in UC.
Does this make sense in your opinion?