Low P2P GPU bandwidth performance between GeForce GPUs

I used the GeForce RTX 4090 graphics card to test the cuda sample, released from https://github.com/NVIDIA/cuda-samples/, the p2pBandwidthLatencyTest result is bad, there are two low value.
Why is P2P GPU bandwidth performance low?

4x RTX4090, 128G RAM(4*32GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 919.39  **17.79**  31.04  31.03
     1  **28.77** 923.74  31.11  31.02
     2  31.17  31.22 923.19  31.25
     3  31.18  31.19  31.31 923.46

After removing 2 devices, the performance reaches the theoretical value.
2x RTX4090, 128G RAM(4*32GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1
     0 918.58  31.24
     1  31.16 923.46

I tried to increase the maximum RAM capacity, but the result was the same as before
4x RTX4090, 256G RAM(4*64GB DDR5-4000)

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 918.31  **20.51**  30.13  28.48 
     1  **23.52** 922.92  29.83  30.39 
     2  28.36  29.70 923.64  29.69 
     3  30.93  30.18  29.42 923.46 

How to increase RTX4090’s P2P performance?
Are there any requirements for this cuda sample?
I look forward to your reply^^

What does the system look like? What type of CPU is used? Single socket, dual socket?

Are there four PCIe 4.0 x16 slots for the four GPUs to plug into? Does the CPU provide > 64 PCIe 4.0 lanes?

2 Likes

P2P is not supported on Ada desktop cards, see here.

1 Like

While true P2P is not possible there is a fall-back mode where communication is via the host and PCIe. I interpreted the question as inquiring why not all GPU pairs in this system have the same communication throughput via PCIe when there are four GPUs.

1 Like

Fair point and that does seem the gist of the OP’s question. I guess I stopped at,

1 Like

One observation, if the “Bidirectional”, in the results means full duplex, the 31GB/s figure would seem to indicate PCIe Gen3 performance?

1 Like

Or gen4 x8.

There is a project here adding P2P to 4090, which may be of interest.

1 Like

4U SYSTEM(up to 8 GPUs). 2x Intel Xeon Silver 4410Y. Dual socket 4th Gen Xeon motherboard.

Yes, there are. There are 80 PCIe 4.0 lanes in processor, so there are 160 lanes in total.

Got it! I understand… true P2P may require nvlink or nvswitch to support it.
P2P for PCIe devices has been around for many years and has different derivative interpretations. I think so: It’s like the difference between P2P Access(direct!) and P2P Copy(via the host and PCIe).

I think… it’s Gen4 x16!
Gen4 x8 → 15.8GB/s
Gen4 x16 → 31.5GB/s
from https://en.wikipedia.org/wiki/PCI_Express

Sorry, you are right regarding the PCIe speed per lane.

The slow speed seems to appear only between devices 0 and 1. Have you removed device 2 and 3 in your second test with two GPUs?

You could change the source code of the bandwidth test and with 4 installed GPUs only work with 2 at the same time to see, if some link is over its capacity or if installing as many GPUs leads to some reconfiguration.

Which CPU’s lanes are the four cards assigned to?

Have you tried affinity settings to let the benchmark run on the on or the other CPU?

If you look at the chart you link, Note i at the bottom states, “In each direction”, so if the test is bidirectional or full duplex, the throughput should be twice this.

The same test run on 4090s with the P2P enabled driver refered to above, shows throughput of 50GB/s.

PCIe uses packetized transport. There are discrepancies between theoretical PCIe throughput and what is practically achievable at the supported packet size (I think 256 bytes these days, but don’t quote me on that). What I have seen in practice for unidirectional traffic is

Between 12GB/sec and 13 GB/sec for PCIe 3.0 x16 interface
Around 25 GB/sec for PCIe 4.0 x16 interface

To my knowledge there are no GPUs with a PCIe 5.0 interface yet. So if 50 GB/sec are reported for the RTX 4090, it stands to reason that this refers to bidirectional bandwidth, given that PCIe is a full-duplex interconnect.

As has been alluded to in posts by other participants, for best performance in a dual socket system it is important to have each GPU “talk” to the “near” CPU and its associated memory, otherwise inter-socket communication can become a bottleneck. For this, specify processor and memory affinity with a utility like numactl, or use any controls provides by the test app itself (I have not looked at what it offers).