I have two K80s installed on a Supermicro SYS-7048GR-TR running under Ubuntu 16.04 LTS. p2p copies work between the GPUs on each card, in my case GPUs 0 to 1 on card 1 and GPUs 2 and 3 on the second card. However, when I try to establish p2p communication between GPUs on different cards, cudaDeviceCanAccessPeer reports that 1 cannot access 2. The test code is compiled under pgi fortran 2017 17.4 using Mcuda=kepler.
nvidia-smi reports the following:
myname@mybox:~$ nvidia-smi
Sun Sep 23 07:03:56 2018
±----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 70C P0 62W / 149W | 0MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 00000000:05:00.0 Off | 0 |
| N/A 71C P0 71W / 149W | 0MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla K80 Off | 00000000:84:00.0 Off | 0 |
| N/A 77C P0 75W / 149W | 0MiB / 11439MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla K80 Off | 00000000:85:00.0 Off | 0 |
| N/A 74C P0 77W / 149W | 0MiB / 11439MiB | 93% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
The Supermicro box is set up as a compute box, without a full installation of pgi fortran or cuda. The codes I run on it are compiled on my development box and transferred in. I have a copy of the pgi REDIST directory under /opt with a LD_LIBRARY_PATH env variable pointing to the directory. I installed the drivers using “sudo apt-get nvidia-384 nvidia-modprobe”.
My guess is the problem is either out of date or incompatible compiler, driver, BIOS or some combination of those. The problem is, I do not know where to start. Any prior experience on this problems would be much appreciated.