PCIe5 P2P GPU via NICs faster than PCIe switch?

Hi,

Just a curiosity I noticed while testing Connectx-7 P2P performance with RTX Blackwell Pro 6000 server cards - I get ~45GB/s P2P via the NICs, but only 40GB/s between GPUs on the same PCIe switch.

Is there something about GPU to NIC transfers that is optimised for P2P compared to GPU to GPU direct over PCIe?

I mean, in the case of GPU to NIC, we are doing so much more work going from GPU → NIC → Optics → NIC → GPU. Compared to GPU → PCIe switch → GPU anyway.

But the NCCL throughput performance is still higher for the NIC case even if the latency is slightly worse.

It just seemed a little odd to me considering that the GPU to NIC and GPU to GPU performance should be the same as they all share the same PCIe switch.

The specifications for the Connectx-7 state that it sports dual 200 Gbit/s ports, and that it connects to the host via PCIe5. Based on that, an on observed throughput of 45 GB/sec seems impossible, unless (1) some form of online compression is involved; or (2) the throughput measurement methodology is flawed.

Hypothesis (1) could be refuted by transferring data generated with a high-quality PRNG, which makes the data practically incompressible. To refute (2) one would have to have knowledge of how exactly you are measuring.

Is the PCIe switch used operating in PCIe5 mode?

There is a 400Gb/s version of the Connectx-7. Wouldn’t 45GB/s be about the real world maximum on a single port card?

Thanks for the pointer. Apparently, I was looking at an older datasheet. I don’t have experience with 400 Gb/sec interconnects. From my time in networking, I retained a rule thumb that would suggest 40GB/sec as practically achievable throughput for such an interconnect. But it is entirely possible that modern interconnects are more efficient, so 45 GB/sec is plausible when only a single port is in use, as the total bandwidth of the adapter is 400 Gb/sec per the data sheet.

With the NIC side plausibility checked, it seems one would want to take a closer look at the PCIe switch side. Based on raw PCIe5 throughput, it seems to me one should expect between 50 and 55 GB/sec unidirectional throughput, rather than the 40 GB/sec that were observed. Absent knowledge of PCIe5 switch specifications, we cannot determine whether this is a fundamental throughput limit of the PCIe5 switch deployed, a configuration issue, or something else.

I wonder whether there is a better sub-forum for this issue than “CUDA Programming and Performance”, as we normally deal with software issues here, not physical interconnect. Maybe this:

Yes, I am testing with 2 x 400gbit in a single host (direct connected NICs).

The GPUs either traverse the PCIe switch to talk to each other or to the NICs. So I would have thought PCIe switch performance would effect both equally and that the GPU to GPU performance would beat GPU - NIC - NIC - GPU.

But the all the NCCL tests like all reduce show that the NIC route is consistently +5GB/s higher throughput (but as to be expected, with slightly higher latency).