Why cudaDeviceEnablePeerAccess is not default?

According to my profiling result, the throughput of cudaMemcpy will be dramatically improved if I set cudaDeviceEnablePeerAccess, I guess the reason is that GPU will directly copy data to peer GPU instead of to host memory.
So why EnablePeerAccess is not default setting? Is there any side-effect?

NVLink operates transparently within the existing CUDA model. Transfers between NVLink-connected endpoints are automatically routed through NVLink, rather than PCIe. The cudaDeviceEnablePeerAccess() API call remains necessary to enable direct transfers (over either PCIe or NVLink) between GPUs. The cudaDeviceCanAccessPeer() can be used to determine if peer access is possible between any pair of GPUs.

The maximum number of peers a GPU can have varies between architecture. You might have more GPUs than a particular card can connect to directly. You need to then choose which connections are more important.

The only side-effect is more speed :)

1 Like