How can I tell which NVIDIA GPUs will have P2P access to the same GPU on PCIe?

Some NVIDIA GPUs are able to access each other’s memory, at least when they’re on the same PCIe root hub. Unfortunately - not all pairs of GPUs can. I know that, typically, a different architecture generation (e.g. Pascal vs Ampere) might prevent such P2P access; but - what if you have 2 cards of the exact same model? Or rather, how can I tell - without getting the cards and trying - which pairs of GPUs of the same kind will have P2P access to each other? The features table by compute capability doesn’t have this information.

The programming guide says that

Depending on the system properties, specifically the PCIe and/or NVLINK topology, devices are able to address each other’s memory … peer-to-peer memory access

But which devices (if not pairs of devices) offer it?

1 Like

Hi,

A quote from Robert here:

" Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup."

I know it varies, that’s why I asked my question. Which GPUs support P2P when installed in pairs?

Also, how does the system have anything to do with it? i.e. what in my system can preclude P2P support?

I don’t know of a way. Therefore what follows is not an answer to your question. It’s merely some “community” commentary, which may or may not be useful.

There isn’t any list published anywhere that I am familiar with, nor any table of characteristics or documentation that provides a list or from which a list could be assembled. Furthermore, there are no static electronically readable characteristics (such as cudaDeviceProp) that pertain to this. The only electronic method is provided by cudaDeviceCanAccessPeer, as far as that goes. If you have for example two GPUs of type X, and you install them in a system, and you observe a positive result from cudaDeviceCanAccessPeer between those two GPUs of type X, then you can be sure that X is on the list of what is theoretically possible. However, the reverse is not true: just because you observed a false result from cudaDeviceCanAccessPeer() does not mean categorically that GPU of type X is not on “the list”. The presence or not of NVLINK/NVLINK bridges also impacts “the list”. As far as this particular sub-topic goes (“the list”) you’re welcome to file a bug requesting documentation, if you wish. But, as we shall see, just because a GPU type X would be on a imagined published list, does not guarantee that if you buy two GPUs of type X, you will witness P2P subsequently.

The system type does not have anything to do with whether or not a GPU of type X is theoretically capable of P2P, as defined/characterized above (i.e. on “the list”). However the system type does matter in order to actually establish/witness P2P, which is surely what people actually care about (my opinion, others may disagree).

To give an example (I have no intention of providing an exhaustive decoder ring in this answer):

Some systems have two CPU sockets, and therefore may have multiple PCIE root complexes, and multiple PCIE sockets, some of which belong to a root complex associated with the first CPU socket, and some of which belong to a root complex associated with the second CPU socket.

For some of these systems so described, P2P might not work properly if a GPU of type X is installed in a PCIE socket associated with the the first CPU socket, and another GPU of type X is installed in a PCIE socket associated with the second CPU socket, even if, in another system type, two GPUs of type X were found to report true for the cudaDeviceCanAccessPeer inquiry. The observations vary here: some systems with two sockets may not provide P2P at all between the two sockets, some may provide P2P at approximately “full” speed, and some may provide P2P but at reduced speed between the two sockets, as compared to the case where both GPUs are on the same PCIE fabric (i.e. attached to the same root complex.) Furthermore, GPUs on the same fabric but communicating directly with the root complex may experience lower P2P throughput as compared to GPUs that are on the same fabric that has an interposing PCIE bridge, both GPUs hung off the same bridge. That is a fairly common observation.

The final (and only) arbiter of this is cudaDeviceCanAccessPeer. The topic is complex, compounded by a number of factors. I don’t have the knowledge to list every factor which may impact P2P.

To give some additional anecdotal “color”:
When the CUDA developers and driver engineers are testing out and characterizing a new CPU or a new set of core logic, they often find that certain scenarios do not work correctly, and work with other suppliers to try to resolve those technical issues. It has historically sometimes been the case that a particular scenario initially does not work, and so cudaDeviceCanAccessPeer() reports false, and then later, with additional debug work and perhaps additional system BIOS development work, the scenario is made to work correctly. Later on then, in a subsequent GPU driver and CUDA version (and perhaps system BIOS update), it is possible that that previously non-working scenario is now reported as working.

As another example of possible system dependency, it is documented that certain systems that use PCIE bridges may require specific settings on those PCIE bridges, in order for P2P to work correctly.

So let me say it again: the final (and only) arbiter in any setting is cudaDeviceCanAccessPeer().

So you ask, “how am I to do my shopping?” For new system purchases, this topic (if the capability is desired) should be brought to the attention of the system provider, and made a condition of the purchase/sale. It can then be tested as part of system commissioning, using utilities such as deviceQuery and p2pBandwidthLatencyTest. For systems that come pre-assembled from the system provider with NVLINK bridges installed, and/or DGX systems and/or systems that are based on HGX technology, it’s reasonable to assume that those systems should offer P2P between bridged GPUs or between GPUs with NVLINK connectivity (although the exact details here may vary. DGX-1 and 1V have a ring-like fabric - so called “hypercube mesh” which only offered NVLINK P2P between directly connected GPUs. Systems like DGX-A100 and DGX-H100 have NVLINK switch fabrics which offer NVLINK P2P between any two GPUs in the same system, potentially/if enabled.)

For other purchases (“I bought two RTX 3090’s”, or “I had this server sitting around, and I bought two RTX A5000 to put in it”), then you must use word of mouth, or best guess based on survey of reports on forums, to make a best-guess determination.

It is not guaranteed. Your only recourse is to return the GPUs you bought. NVIDIA doesn’t guarantee to make P2P work in any setting you may choose.

For future readers, nothing I have written here should be construed as an affirmed statement that P2P will work in your particular setup. The final and only arbiter is the result of cudaDeviceCanAccessPeer(), not anything here.

For some additional color, P2P can be “broken” even when cudaDeviceCanAccessPeer() reports “true”. I’m not suggesting that I know that the SLI statement in that thread is accurate, but it is certainly the case that someone reported P2P trouble in a system and then later reported P2P success in the same system, after fiddling with some aspect of system configuration.

Customers that desire support with this topic have a few options, at least.

  1. You can use community support (i.e. post questions on this forum.) AFAIK, NVIDIA provides no guarantees that any or all questions posted on these forums will be resolved.

  2. You can file a bug. If the problem cannot be reproduced conveniently by NVIDIA QA handling the bug, forward progress may be difficult or ineffective. (Just setting expectations.) There is no guarantee that all bugs will be resolved.

  3. For systems purchased from a reputable vendor, you can pursue support with the vendor. Major system OEMs like Dell, HPE, Supermicro, and many others, have their own dedicated support paths directly to NVIDIA. If you added GPUs to the system yourself, I personally would not expect much support from the system vendor on GPU usage. NVIDIA generally recommends that systems intended for enterprise usage be configured (i.e. assembled) by knowledgeable system OEMs or system integrators, not by end-users. There are a variety of reasons for this. If you put a GPU in a system in a configuration that was not qualified by the OEM, the OEM should probably refuse to support that, and NVIDIA would definitely refuse to support any requests from the system OEM related to such a configuration. Furthermore, a system OEM or integrator is under no obligation to provide any support for any hardware you did not purchase from them.

  4. You can purchase an entitlement to NVIDIA enterprise support. The most common/typical way to do this for GPU related work is to purchase a NVIDIA AI Enterprise license, which comes with a number of benefits beyond just support. The most common way to purchase a NVIDIA AI Enterprise License/Entitlement is to work through the system vendor that supplied the system. Again, if you purchased the system without GPUs, this probably won’t work/is not available. DGX systems generally come with support, and H100 PCIE purchases via a reputable system OEM/integrator at the current time come with an included NVIDIA AI Enterprise License.

A number of these support statements pertain to enterprise products (only). For GeForce GPUs, the support scenario(s) are more limited. GeForce GPUs are not intended for enterprise usage.

1 Like

Ok (well, not ok, but ok for the purposes of this paragraph), so how about a simple and common case then? Plain vanilla consumer PC motherboard from a respectable/popular manufacturer, single PCIe root complex, no bridges, single CPU socket. Recent Linux distribution with an up-to-date CUDA distribution and the driver that comes with it. No special clocking of anything. Each one of the two cards working individually. Identical cards - same vendor, same model, same everything. Is it still complex then?

But what if someone wants to choose, or recommend, a card which will have peer access when bought in a pair? i.e. which will have peer access on systems without weird gotchas?

Hmm. That’s an interesting idea. Do you believe they might tell me “Oh, P2P is not supported for your GPU, don’t bother?”

Yes. Rather than talk about things being complex or not, if it seems more understandable to say “I won’t be able to do that.” then let’s use that formulation, instead. Complexity is in the eye of the beholder.

I refer you to my statements about shopping.

I don’t know what they would do. I do know that there really aren’t restrictions on anyone using the bug filing portal. It is one of the support paths that NVIDIA uses to tackle certain issues.