Strange behavior after upgrade from 17.4 to 18.7

I am trying to upgrade from the 17.4 release running under Ubuntu 14.04 to the 18.7 release running under a fresh install of Ubuntu 16.04 LTS. The install process and license tool seem to run fine. A test compile of a code that ran under my old setup completed without error. However, when I ran the code it returned a large negative number for the number of gpu devices available (should have been 3). Also, when I tried to run nvidia-smi, it returned “command not found”. It seems like the installation did not complete. Anyone have any ideas how to proceed?

Hi John,

This sounds like a driver issue.

I believe Ubuntu will install the nouveau drivers by default, which can’t be used for compute. Instead, please make sure the NVIDIA driver is installed (Official Drivers | NVIDIA).

-Mat

The driver did the trick. However, I had to go about the installation in a different way. The link you posted took me to Nvidia’s download page. I selected driver for the C2075s I have on my development box and Ubuntu 16.04 LTS. This led to a download of a .deb file. After installing that by their instructions, it rebooted into the infinite-login-loop problem. After cntl-alt-F3 to the shell, I ran nvidia-smi. It returned “unable to connect to drivers, make sure you have the latest drivers installed”. So I purged that installation with apt-get purge nvidia* and installed using Ubuntu’s repository by apt-get install nvidia-390. This approach worked on my development box, in that nvidia-smi sees the cuda devices and my codes compile and run with out error, and no login-loop issue. So problem solved on my development box. However, when I did the same thing on my Supermicro compute box, which has two K80s installed, I get the original problem. When the code checks the number of cuda devices it gets a large negative number. Both boxes are running Ubuntu 16.04 LTS. The Supermicro has a fresh BIOS that is from May of this year.

I have had the Supermicro a year with one K80 installed and it ran fine under Ubuntu 14.04 with code compiled on my development box using pgi release 17.4. When I added the second K80 this summer I found that it would run jobs on either K80, but when I tried to run one job on both, I could not transfer data between the cards. They are on different sectors of the PCIe bus, but my code uses p2p for gpus on the same sector and cudaMemcpy for gpus for which p2p is not available. The p2p transfers between gpus on the same card worked, but cudaMemcpy for transfers between gpus on different sectors failed. These copies were trying to copy data from one gpu to the host and then another copy from the host to another gpu. The first copy from the gpu to the host would fail. The same code runs fine on the development box. I figured I would try updating everything to the latest versions. So I have upgraded both machines to 16.04, updated the compiler to 18.7, and updated to BIOS on the Supermicro to the latest available. Now I cannot get the codes to run on the individual cards. This is a gpu server box designed to support four gpus. At this point, I am out of ideas.

Hi John,

I’ve asked my IT folks to take a look at your post to see if they have any ideas.

-Mat

I feel like I should add a few additional points of clarification: The problem I was having under Ubuntu 14.04/pgi compiler release 17.4 was that memory copy and even direct assignment statements would not transfer data between a gpu and the host. I use this approach to transfer data between gpus for which p2p is not available due to the fact that the gpus are on different root complexes of the PCIe bus. The Supermicro box has two cpus, each with its own root complex that controls two slots. My recent updates and upgrades were an attempt to get around this problem, which has led to other problems, rather than fixing the original problem. My guess is there is some problem with the driver install on the Supermicro box. I have tried several install procedures and different drivers (375, 384, and 390). Each produces its own set of the problems on the Supermicro box. My development box worked well under both Ubuntu 14.04/release 17.4 and 16.04/release 18.7. It is based on an Asus mb with no onboard graphics, a single i7 cpu, and three pci2 slots. One slots as a low-level Quadro graphics card used for graphics. The other two slots have C2075s installed. Using the 17.4 release I would compile to run on the development box with Mcude=cc20. This option is not available on the 18.7 release, but the ccall seems to work fine. I use Mcuda=cc3.x to compile for the K80s on the Supermicro box. Because my development box has only one cpu and one root complex, I cannot test data transfers between gpus on different complexes. I instead force transfers to be all p2p or all memcpy between gpu and host and then host to another gpu to test both routes.

Hi John,

Here’s the response from our IT folks:

It would be helpful to know what nvidia-smi is reporting on his production box. I can’t do much with the negative number he mentions; that’s just a function of whatever his code is doing.

The NVIDIA drivers contain kernel modules that are tied to specific kernel version. There is a mechanism (Dynamic Kernel Module Support) that rebuilds kernel modules when you upgrade kernel versions, but this doesn’t always work when jumping from something like Ubuntu 14.04 to 16.04, and whether it’s even enabled depends on how the drivers were originally installed.

There are a lot of different ways to install NVIDIA drivers. I suspect he has jumbled up pieces from multiple installations. Installing drivers from different sources and via different mechanisms is the quickest way to end up with a broken environment. So step 1 is to remove everything NVIDIA.

I wouldn’t use nvidia-390 from Ubuntu. That should be uninstalled via the package manager. There may be some bad dependency issues there that lead to other things being deleted, which is another reason to avoid the package manager version.

After that, check to see if there is a /usr/bin/nvidia-uninstall script. If so, run it.

Then:
sudo apt-get remove --purge nvidia-*

After a reboot, everything NVIDIA should be removed. Check via:
lsmod | grep nvidia
lsmod | grep nouveau

If there are no graphics drivers running, nothing should be returned. If something is returned, then perhaps something like a running GUI is preventing the drivers from being uninstalled. He should not only switch to the shell, but run: “systemctl isolate multi-user.target” to disable the display manager before running all of these purge commands. He can then run " systemctl start graphical.target" to start the GUI again.

If nvidia modules still remain present, there is further troubleshooting that can be done, but let’s cross that bridge when we come to it.

Once the drivers are purged, he should install the NVIDIA drivers from here and follow the instructions under “additional information”:
Tesla Driver for Ubuntu 16.04 | 396.44 | Linux 64-bit Ubuntu 16.04 | NVIDIA

His P2P problems are beyond the scope of anything I know about, but will perhaps be addressed by the driver upgrade.

-Mat