I used the GeForce RTX 4090 graphics card to test the cuda sample, released from https://github.com/NVIDIA/cuda-samples/, the p2pBandwidthLatencyTest result is bad, there are two low value.
Why is P2P GPU bandwidth performance low?
While true P2P is not possible there is a fall-back mode where communication is via the host and PCIe. I interpreted the question as inquiring why not all GPU pairs in this system have the same communication throughput via PCIe when there are four GPUs.
Got it! I understand… true P2P may require nvlink or nvswitch to support it.
P2P for PCIe devices has been around for many years and has different derivative interpretations. I think so: It’s like the difference between P2P Access(direct!) and P2P Copy(via the host and PCIe).
Sorry, you are right regarding the PCIe speed per lane.
The slow speed seems to appear only between devices 0 and 1. Have you removed device 2 and 3 in your second test with two GPUs?
You could change the source code of the bandwidth test and with 4 installed GPUs only work with 2 at the same time to see, if some link is over its capacity or if installing as many GPUs leads to some reconfiguration.
Which CPU’s lanes are the four cards assigned to?
Have you tried affinity settings to let the benchmark run on the on or the other CPU?
If you look at the chart you link, Note i at the bottom states, “In each direction”, so if the test is bidirectional or full duplex, the throughput should be twice this.
The same test run on 4090s with the P2P enabled driver refered to above, shows throughput of 50GB/s.
PCIe uses packetized transport. There are discrepancies between theoretical PCIe throughput and what is practically achievable at the supported packet size (I think 256 bytes these days, but don’t quote me on that). What I have seen in practice for unidirectional traffic is
Between 12GB/sec and 13 GB/sec for PCIe 3.0 x16 interface
Around 25 GB/sec for PCIe 4.0 x16 interface
To my knowledge there are no GPUs with a PCIe 5.0 interface yet. So if 50 GB/sec are reported for the RTX 4090, it stands to reason that this refers to bidirectional bandwidth, given that PCIe is a full-duplex interconnect.
As has been alluded to in posts by other participants, for best performance in a dual socket system it is important to have each GPU “talk” to the “near” CPU and its associated memory, otherwise inter-socket communication can become a bottleneck. For this, specify processor and memory affinity with a utility like numactl, or use any controls provides by the test app itself (I have not looked at what it offers).