K80 p2p works between onboard GPUs but not between GPUs on different cards

I have two K80s installed on a Supermicro SYS-7048GR-TR running under Ubuntu 16.04 LTS. p2p copies work between the GPUs on each card, in my case GPUs 0 to 1 on card 1 and GPUs 2 and 3 on the second card. However, when I try to establish p2p communication between GPUs on different cards, cudaDeviceCanAccessPeer reports that 1 cannot access 2. The test code is compiled under pgi fortran 2017 17.4 using Mcuda=kepler.

nvidia-smi reports the following:

myname@mybox:~$ nvidia-smi
Sun Sep 23 07:03:56 2018
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 Tesla K80 Off | 00000000:04:00.0 Off | 0 |
| N/A 70C P0 62W / 149W | 0MiB / 11439MiB | 0% Default |
| 1 Tesla K80 Off | 00000000:05:00.0 Off | 0 |
| N/A 71C P0 71W / 149W | 0MiB / 11439MiB | 0% Default |
| 2 Tesla K80 Off | 00000000:84:00.0 Off | 0 |
| N/A 77C P0 75W / 149W | 0MiB / 11439MiB | 0% Default |
| 3 Tesla K80 Off | 00000000:85:00.0 Off | 0 |
| N/A 74C P0 77W / 149W | 0MiB / 11439MiB | 93% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| No running processes found |

The Supermicro box is set up as a compute box, without a full installation of pgi fortran or cuda. The codes I run on it are compiled on my development box and transferred in. I have a copy of the pgi REDIST directory under /opt with a LD_LIBRARY_PATH env variable pointing to the directory. I installed the drivers using “sudo apt-get nvidia-384 nvidia-modprobe”.

My guess is the problem is either out of date or incompatible compiler, driver, BIOS or some combination of those. The problem is, I do not know where to start. Any prior experience on this problems would be much appreciated.

Add the PPA repository and update the driver:
See if that helps.

Good suggestion, but it did not work. In fact, in made things worse.

I also contacted Supermicro support on this problem. They pointed out that BIOS that came with the case was old (2015) and suggest that an updated BIOS might help. So I down loaded the newest BIOS for the MB (from May of this year) and flashed it. The update seems to work fine, but did not solve the problem. So then I did as you suggesed:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
The notes produce by this process said that the previous driver 384 was now replaced with 390. So I modified my usual install command to the following.

sudo apt-get install nvidia-390 nvidia-prime

That ran with no errors. However, when I ran by p2p test code, it stopped when it could not initialize the first GPU on the second card. Using the 384 driver I could run programs that just used 2 GPUs either on card 1 (with GPUs 0 and 1) or on card 2 (with GPUs 2 and 3). With the 390 driver installed, I could not use the second card at all.

Even more puzzling, nvidia-smi produces the response:
“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest driver is installed and running.” So I rebooted the machine to see if that would help. Now I have the infinite-login-loop problem and could not get back to my desktop. So I did a cntrl-alt-F3 to get to the base shell and logged in there. After the reboot, I have nvidia-smi back and it sees all four GPUs correctly. However, when I tried to run my p2ptest program it reported that both cards and “…fallen off the bus.” specifically it reported:
[139,909746] NVRM: xid (PCI:0000:85:00) 79, GPU has fallen of the bus.
[139,910516] NVRM: xid (PCI:0000:84:00) 79, GPU has fallen of the bus.

So I thought I would try setting the cards to run in perpetual mode using:
sudo nvidia-smi -pm 1

This returned:
“Unable to determine the device handler for GPU 0000:84:00”, which I assume is the second card. After this I purged the new driver and re-installed the 384 version. In this condition the p2p test was able to communicate between GPUs 0 and 1 on card 1, as before. A second run verified communication between GPUs 2 and 3 on the second card. As before, a run to test communication between GPU 1 on card 1 and GPU 2 on card 2 failed to establish communication.

These results tell me that the update in the BIOS did not influence p2p communicate, but the update to the most recent GPU drivers changed p2p, allbeit in a negative way. I am thinking that my next step is to re-build my development machine with the latest pgi compilers and test the p2p test program compile with the latest compiler with both vintages of drivers, to see if one works. Because I several weeks invested in getting my development box up and running last year, I have been reluctant to do this. I also do not want to lose all ability to compile and use my applications on this box. So my plan is to pull its hard drive and place it with a new one and install everything from scratch to test the new compilers. That way I can always get back to the point where I can compile and run my applications on my development box. It will take a few days. If you or anyone else has additional ideas or things to try, please let me know.

Thanks looking at this and for your suggestion.

But at least you can roll back to 384, yes?
I’ve just recently made a 16.04 installation and could only get CUDA 9.2 to work after installing 390, as 384 had all sorts of strange errors…

Yes, I am able to roll back to 384. So I have not lost any ground. What I think I will try next is updating my compiler. I have been compiling with PGI Fortran release 17.4 from last year. For this reason I was trying to keep everything the same vintage. That is not working, so I think I will update my PGI compiler to their latest and then see if either 384 or 390 will work with that. Thanks for your suggestions. If one of these changes solves the problem I will let you know.

After my last reply I did some more googling and found the answer to my problem in an old post (2016) in this forum, titled “MultiGPU P2P Access Weird result”.

In this post the user runs the cuda simpleP2P example program on a machine that has four K80s installed. The results show that GPUs on the first two K80 cards can talk to each other and the GPUs on second two cards can talk to each other, but the GPUs on cards 1 and 2 cannot talk to the GPUs on cards 3 and 4.

The answer by user “njuffa” was that p2p communication requires that the GPUs are on the same PCIe root complex. Each CPU has its own root complex. He then guesses that the user’s machine is a dual CPU machine in which CPU 1 interacts with PCIe sockets 1 and 2 and CPU 2 interacts with sockets 3 and 4.

My machine is a dual CPU machine with four PCIe sockets and I happen to have the K80s installed in sockets 1 and 3, thinking that because these things are really hot, I would spread them out. The solution is to move card 2 to socket 2. This is really not the answer I was hoping for, because it means when I intall the second pair of GPUs (which I already own) there will be a communication bottle neck between the two sets of GPUs. Teh upside is that I now think I understand the problem and it is a limitation of the way the PCIe is set up and not a software problem.

I hope this information helps others down the road.

Thanks much!