HELP! Unable to get K20Xm TESLA drivers to work with Ubuntu 16.04 LTS

Dear all,

I have two Dell workstations (a Dell T7400 and a Dell T7500), each with 32GB RAM and two nVidia TESLA K20Xm GPUs (one for each machine). The workstations are running Ubuntu 16.04 LTS.

I want to work with Tensorflow and CUDA 10.1 or 10.2 (preferred). However, despite trying multiple times in different ways, I’m unable to get the nVdia GPU and CUDA drivers working.

CUDA drivers for Dell T7500 with nVidia TESLA K20Xm GPU
The first machine, a T7500, had the K20Xm GPU card as well as an old Quadro FX1800 graphics card (working well at high res). I first checked the pre-installation checklist for the -440 release drivers to ensure I have the correct starting point. I downloaded the run file (by selecting the options for OS etc. from the nVidia website), and tried running it from a terminal session with the lightdm service stopped. It initially warned that the “pre-installation script failed”, but reading other forums I continued through this. It then said it detected the old Quadro FX1800 GPU but would ignore it (I felt this promising as I need to use the TESLA K20Xm). I chose yes to DKMS module installation, and then after restart I was unable to log-in (login loop logging me back out immediately). Most forums report the login loop is to be corrected by either removing nVidia drivers or checking permissions on .Xautority and .XICEauthority (both had correct permissions). Removal of the drivers didn’t fix the problem, I had to remove all of the display manager and lightdm packages and re-install, but then I was left without the GPU drivers. I tried the run file installation again, but after having blacklisted Nouveau. The GPU didn’t work (nvidia-smi didn’t run correctly, despite path set) and my display settings were very poor.

Thinking that perhaps the nVidia Quadro FX graphics card was compounding the problem, I replaced this with an ATI Radeon graphics card ("[AMD/ATI] RV710/M92 [Mobility Radeon HD 4530/4570/545v]"). I installed the Radeon drivers and had it running nicely (with high res), and then re-attempted the -440 drivers installation again via run file. When this failed, I then removed everything and tried by adding a PPA and installing using apt-get install nvidia-440 and the CUDA packages, again this did not work. I also tried “ubuntu-drivers autoinstall”, and this didn’t work. This T7500 workstation is now left with broken packages, non-installed GPU and low-res graphics. It appears that the menu bars and icons are now missing from LightDM.

CUDA drivers for Dell T7400 with nVidia TESLA K20Xm GPU
The next machine (which also has a TESLA K20Xm GPU), is a Dell T7400 with an old (but working fine) nVidia Quadro NVS300 (“NVIDIA Corporation GT218 [NVS 300] (rev a2)”). I tried the the CUDA-drivers installation run file, which I understand has options to install drivers, cuda drivers, cuda samples, etc. Again, I first shutdown the lightdm service and blacklisted nouveau. After installation, I experienced similar problems. The graphics drivers were disrupted. I then tried removing the K20 GPU drivers “nvidia-uninstall”, and trying again but using PPA and first 418 and then 430 drivers, again no luck. This workstation also now has the same issue with the login loop.

So my question: i) How does one correctly install the nVidia GPU drivers alongside a graphics card (without disrupting the graphics card drivers as, understandably, the GPU has no graphics port). Also, how does the DKMS work with nVidia drivers, and should I be using that? (All I understand is that it maintains kernal builds which is useful for future updates, re-builds of the kernal).

These two machines are part of a cluster which is running Infiniband without a switch, it was tricky to get this fabric up and running with the Mellanox drivers so I don’t wish to go-about re-installing Ubuntu 16.04 (also a lot of work/configuration has been done since which I don’t want to have to re-do). Can someone please suggest what I can do to get the GPUs running on these machines.

This is driving me crazy, I feel it shouldn’t be this hard!
Help would be greatly appreciated. Many thanks in advance!

Jamie

First of all, forget about the Quadro FX and The NVS300, those are deprecated and need legacy drivers which are long past their EOL. They won’t work anymore with recent OS/driver versions, especially not in combo with the Tesla.
Second, Ubuntu 16.04 is pretty much EOL and not really useable when it comes to nvidia, the .run installer can’t really be used on that or only with extensive care.
So, on the System with the Radeon card, please uninstall the .run installer first using the --uninstall option, then remove the packages using
sudo apt remove nvidia*
and check if the radeon works again.

Only then, you can carefully use the .run installer using the options

--dkms --no-opengl-files

After reboot, the radeon should still be used and the Teslas accessible.
If not, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Hi generix, thank you for your suggestions.

Update:
I’ve fixed the login-loop both machines experienced. In the T7400, I replaced the nVidia Quadro NVS 300 with an AMD/ATI Radeon 5450, hopefully this will avert conflict by remove the need for two sets of nVidia drivers to load, or for one driver to drive both cards (which is unlikely).

> First of all, forget about the Quadro FX and The NVS300, those are deprecated and need legacy drivers which are long past their EOL.

I understand you’re right about being EOL, but I had the nvidia drivers loaded and working for this at high-res. I did replace the old nVidia FX1800 card in the T7500 with an AMD/ATI Radeon one (in my orginal post). Also, as mentioned above, I’ve done the same with the T7400 (replacing the old NVS300), with an AMD/ATI Radeon.

> They won’t work anymore with recent OS/driver versions, especially not in combo with the Tesla.

Yes, I agree the FX1800/NVS300 both seems to be causing problems with the K20 Tesla. I was thinking it would be possible to set the PCI BusID in the “Device” section of /etc/X11/xorg.conf, i.e. first checking the PCI Bus ID of the VGA Graphics adapter (using lspci | grep VGA) and adding it along with nVidia for driver, but leaving out the TESLA GPU from xorg.conf. I tried this but it didn’t work.

> Second, Ubuntu 16.04 is pretty much EOL and not really useable when it comes to nvidia, the .run installer can’t really be used on that or only with extensive care.

I agree, the 18.04 and newly released 18.04 work fine on these machines, but I am getting >8-9 GBps of bandwith using older mellanox cards, I’m not sure if the mlx4 modules and subnet manager are available for Ubuntu >= 18.04, and these are critical to the performance of the cluster.

As for the run installer, it confuses me that the documentation for the run file shows both Ubuntu 16.04 LTS and 18.04 LTS as suported…

https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html

I also tried all of the pre-install checks which I’ll attach in my next post.

> So, on the System with the Radeon card, please uninstall the .run installer first using the --uninstall option, then remove the packages using, sudo apt remove nvidia, and check if the radeon works again.*

Thanks, yes I have purged the nvidia* driver packages and also removed the modified xorg.conf (also, this was config file was causing problems with lightdm starting).

> Only then, you can carefully use the .run installer using the options

–dkms --no-opengl-files

Thanks will try this. I saw the --no-opengl-files and overlooked that, so I will try that, thank you.

> After reboot, the radeon should still be used and the Teslas accessible.

If not, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Thanks again, will do. I’m quite apprehensive going through the whole broken ubuntu-desktop and login loop but I’ll try it gain with the --dkms and --no-opengl-files parameters.

Attached is the file detailing the pre-install checks…

nVidia-K20-Preinstall-check.txt (3.6 KB)

Update:

After a bit of anxiety of having to go through the whole fixing Ubuntu-Desktop and re-installing all the related packages - I had already done that to get it back up and running with just the Radeon Driver. I restarted the install.

However, this time I used the CUDA 10.2 installer (which has the Driver, CUDA, Samples etc…). It failed last time, but when I added the -no-opengl-files option (which you suggested), it worked fine. I can’t thank you enough!

BTW - the newest nVidia GPU installation manual (June-2020) does actually mention this, but I overlooked it with my focus being blacklisting nouveau.