My code attempts to enable peer access by GPU 0 to the other 7 GPUs in the system.
The first 4 pass cudaDeviceCanAccessPeer, but the last 3 fail.
This causes the code to run much slower than it does on a 4 GPU instance.
When profiled, I get the message:
==8804== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
I believe this is a DGX-1 Station, and I’m running a Win 2016 OS with SDK 9.1 and the latest driver as of mid January.
The profiler shows a kernel execution time which is 1ms when not accessing UVM, takes over 900ms when trying to enable UVM.
The same kernel takes 2ms on a 4 GPU (p3.8xlarge) instance (with UVM), processing twice as much data.
Does anyone have any idea why 3 of 7 GPUs fail the peer access test?