Peer-to-Peer Memory Access can suppport a system-wide max of 8 peer connections

According to Cuda Documentation,

3.2.6.4. Peer-to-Peer Memory Access

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.

What does “Each device can support a max of eight peer connections” mean?
Are these 8 simultaneous connections?

From this article,
https://www.servethehome.com/single-root-or-dual-root-for-deep-learning-gpu-to-gpu-systems/

The author gave results for P2P enabled on a 10x GPU single-root system. They ran the p2pBandwidthLatencyTest.

Does this mean that P2P is capable on 8+ gpu systems?

Thanks!
Simon.

Yes, it is 8 simultaneous connections to a particular device. If you make a peer-to-peer association between devices A and B, and then disable that association but enable it between A and C, you can repeat this process for an arbitrarily large number of devices. But at any given moment, A cannot be peer-enabled to more than 8 other devices.

Hi txbob,

So from that article, the Nvidia p2pBandwidthLatencyTest which the author ran is only using one gpu pair at a time.

Is there any Nvidia supplied sample that can test 8 simultaneous peer-connected bandwidth and latency?

I haven’t doubled checked the source code lately, but I suspect that is what you would see if you looked at it.

I’m not aware of one. Such an app would have several levels of complexity:

  • simultaneous communication could have a lot of possible permutations
  • simultaneous communication will stress different PCIE topologies in different ways, leading to less “predictability” or probably a bettter word would be “consistency” of the results. Of course the results may be predictable given sufficient knowledge of the PCIE topology, but a great many users of this technology don’t really understand PCIE topology ramifications in great depth, so trying to interpret the results might be difficult.

Having said that, the CUDA sample apps first and foremost are designed to be teaching tools, not test or validation utilities (although they obviously serve that purpose to some degree as well). If you wanted to design your own simultaneous communication test app, the p2pBandwidthLatencyTest app should be a pretty good roadmap.

Thanks txbob!