I have a setup of two V100 SXM blades, and one setup with two 980 Ti blades.
I am checking with cudaDeviceCanAccessPeer if peer access is supported and if it is use cudaDeviceEnablePeerAccess to enable it.

I than use cudaMemcpyPeerAsync to copy data between devices.

There are three types of device to device copies:

  1. PCI via host to PCI - slowest
  2. PCI via PCI switch to PCI (AKA RDMA)
  3. using NVLINK - fastest

My question is how can i be sure which copy method is used in both setups?
in case both RDMA and NVLINK is supported which one will be used?
I of course want the fastest method available to the HW.


If two GPUs in your system are directly connected by NVLINK, and you enable peer access between them, any subsequent cudaMemcpyPeerAsync calls between them will always use NVLINK.

Transfer between two GPUs over PCIE is not called RDMA. That is for transfer between a GPU and 3rd party device.

And if i do not enable peer access? what kind of transfer will be used?
PCI via PCI switch or via the host CPU?


via the host CPU