GPU Direct + PCIe topology

womble · June 24, 2024, 3:36pm

I’m looking to deploy 1x L4 GPU and 1x ConnectX-6 DX NIC in nodes based on this motherboard:

The platform uses a 4th Gen Intel Xeon Scalable Processor (4410Y). The motherboard data sheet linked above shows the x16 slots’ PCIe lanes connect directly to the processor.

I am trying to understand if this system will support high performance GPUDirect RDMA between the NIC and GPU. The supported systems section of this page identifies potential issues with topologies which do not use a PCIe switch and I believe this system would be categorised as “single CPU/IOH”.

Can anyone confirm whether high performance GPUDirect RDMA is achievable on this platform?

womble · June 26, 2024, 6:47am

I just realised that I left out a link from my original post. The question I have is whether the issues referred to in this page are still relevant in 2024 and will cause an issue with the 4th Gen Xeon Scalable CPU:

The “Supported Systems” section of that page states:

Even though the only theoretical requirement for GPUDirect RDMA to work between a third-party device and an NVIDIA GPU is that they share the same root complex, there exist bugs (mostly in chipsets) causing it to perform badly, or not work at all in certain setups.

We can distinguish between three situations, depending on what is on the path between the GPU and the third-party device:

PCIe switches only

single CPU/IOH

CPU/IOH ↔ QPI/HT ↔ CPU/IOH

The first situation, where there are only PCIe switches on the path, is optimal and yields the best performance. The second one, where a single CPU/IOH is involved, works, but yields worse performance ( especially peer-to-peer read bandwidth has been shown to be severely limited on some processor architectures ). Finally, the third situation, where the path traverses a QPI/HT link, may be extremely performance-limited or even not work reliably.

As the PCIe lanes for all slots on the motherboard I’m looking to use go to the 4410Y Xeon (i.e. the motherboard contains no PCIe switches), I think this corresponds to the the “Single CPU/IOH” scenario described in the NVIDIA GPU Direct documentation quoted above. What is not clear is whether modern Intel CPUs suffer from poor peer-to-peer performance or if this is a historic issue.

Robert_Crovella · June 27, 2024, 9:11pm

NVIDIA doesn’t support, or recommend the insertion of a data center GPU such as the L4 into an “arbitrary” motherboard. There are a variety of reasons for this. One of them is that the L4 GPU requires server-managed flow-through cooling. This requires proper integration with the server, and corresponding control code in the server BMC to manage the cooling operation. This is just one of several reasons.

Supported datacenter GPU configurations can be found here. That doesn’t say anything per-se about GPUDirect RDMA, which also involves a networking adapter. To get that level of support/certification, you would want to choose a NVIDIA Certified system. Basically this means looking at the column on the right for “NVIDIA-Certified” instead of just “Qualified” (you could also use the filtering options on that page to select the “Datacenter” certification type). A properly configured certified system, from one of those vendors, should be able to support GPUDirect RDMA between the NIC installed by the vendor, and the L4 GPU installed by the vendor.

system · July 11, 2024, 9:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPUDirect RDMA PCIe Topology CUDA Programming and Performance pcie	3	1099	November 6, 2021
GPUDirect question - cudaDeviceCanAccessPeer information CUDA Programming and Performance	9	4343	January 2, 2020
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2791	April 11, 2019
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13039	May 3, 2011
P2P: How do I know if cudaMemcpy falls back to non-P2P? CUDA Programming and Performance	8	2398	October 12, 2021
P2P DMA performance limitation? where a single CPU is invoked CUDA Programming and Performance	3	1632	November 27, 2017
PCIe traffic between NVIDIA GPUs and AMD EPYC Server Platform CUDA Programming and Performance	5	1975	October 29, 2021
One GPU NOT capable of Peer-to-Peer (P2P) CUDA Programming and Performance	22	5113	November 27, 2018
RDMA GPU Direct Slow CUDA Programming and Performance	10	2462	February 13, 2019
RTX A4000 support GPUDirect RDMA? CUDA Programming and Performance	6	2224	June 25, 2024

GPU Direct + PCIe topology

Related topics