GPUDirect RDMA on PowerEdge R750

I’m developing an RDMA-based application which should write and read to GPU memory.
On a Dell PowerEdge R750 with a T4, Xeon 6346 CPUs (IceLake) and Mellanox CX6, both writes and reads don’t work: while writes fail silently (no error is raised but memory is always 0), the writes are refused by the RNIC, with an error in completing the WQ: 11:remote operation error.

On an almost-identical system, a Poweredge R740xd with CX5 and CascadeLake CPUs, the application works correctly.

The same behavior is reported by perftest, which fails to read when the CUDA memory is used on the Icelake platform.

Are there some known incompatibilities with this hardware setup?

You would want to make sure that the T4 and the CX6 are on the same PCIE fabric. The most direct way to ascertain this is if they are connected to the same root complex. lspci has tools that can show this. The solution might involve moving cards to different slots, or its possible that there is no solution/fix for a particular server or configuration.

I would also suggest making sure you have the latest SBIOS firmware installed on your Dell platforms.

Yes, they are on the same NUMA node.

For reference, the Cascade Lake system, where the read/writes work, have the GPU and the NIC on different NUMAs, so it appears to be a specific problem of the newer CPU architecture or due to the server hardware.

The systems should be up-to-date with the latest drivers and firmwares already, but I will do a double check between multiple machines, just in case.