I’m looking to deploy 1x L4 GPU and 1x ConnectX-6 DX NIC in nodes based on this motherboard:
The platform uses a 4th Gen Intel Xeon Scalable Processor (4410Y). The motherboard data sheet linked above shows the x16 slots’ PCIe lanes connect directly to the processor.
I am trying to understand if this system will support high performance GPUDirect RDMA between the NIC and GPU. The supported systems section of this page identifies potential issues with topologies which do not use a PCIe switch and I believe this system would be categorised as “single CPU/IOH”.
Can anyone confirm whether high performance GPUDirect RDMA is achievable on this platform?
I just realised that I left out a link from my original post. The question I have is whether the issues referred to in this page are still relevant in 2024 and will cause an issue with the 4th Gen Xeon Scalable CPU:
The “Supported Systems” section of that page states:
Even though the only theoretical requirement for GPUDirect RDMA to work between a third-party device and an NVIDIA GPU is that they share the same root complex, there exist bugs (mostly in chipsets) causing it to perform badly, or not work at all in certain setups.
We can distinguish between three situations, depending on what is on the path between the GPU and the third-party device:
PCIe switches only
single CPU/IOH
CPU/IOH ↔ QPI/HT ↔ CPU/IOH
The first situation, where there are only PCIe switches on the path, is optimal and yields the best performance. The second one, where a single CPU/IOH is involved, works, but yields worse performance ( especially peer-to-peer read bandwidth has been shown to be severely limited on some processor architectures ). Finally, the third situation, where the path traverses a QPI/HT link, may be extremely performance-limited or even not work reliably.
As the PCIe lanes for all slots on the motherboard I’m looking to use go to the 4410Y Xeon (i.e. the motherboard contains no PCIe switches), I think this corresponds to the the “Single CPU/IOH” scenario described in the NVIDIA GPU Direct documentation quoted above. What is not clear is whether modern Intel CPUs suffer from poor peer-to-peer performance or if this is a historic issue.
NVIDIA doesn’t support, or recommend the insertion of a data center GPU such as the L4 into an “arbitrary” motherboard. There are a variety of reasons for this. One of them is that the L4 GPU requires server-managed flow-through cooling. This requires proper integration with the server, and corresponding control code in the server BMC to manage the cooling operation. This is just one of several reasons.
Supported datacenter GPU configurations can be found here. That doesn’t say anything per-se about GPUDirect RDMA, which also involves a networking adapter. To get that level of support/certification, you would want to choose a NVIDIA Certified system. Basically this means looking at the column on the right for “NVIDIA-Certified” instead of just “Qualified” (you could also use the filtering options on that page to select the “Datacenter” certification type). A properly configured certified system, from one of those vendors, should be able to support GPUDirect RDMA between the NIC installed by the vendor, and the L4 GPU installed by the vendor.