I have two Dell workstations (a Dell T7400 and a Dell T7500), each with 32GB RAM and two nVidia TESLA K20Xm GPUs (one for each machine). The workstations are running Ubuntu 16.04 LTS.
I want to work with Tensorflow and CUDA 10.1 or 10.2 (preferred). However, despite trying multiple times in different ways, I’m unable to get the nVdia GPU and CUDA drivers working.
CUDA drivers for Dell T7500 with nVidia TESLA K20Xm GPU
The first machine, a T7500, had the K20Xm GPU card as well as an old Quadro FX1800 graphics card (working well at high res). I first checked the pre-installation checklist for the -440 release drivers to ensure I have the correct starting point. I downloaded the run file (by selecting the options for OS etc. from the nVidia website), and tried running it from a terminal session with the lightdm service stopped. It initially warned that the “pre-installation script failed”, but reading other forums I continued through this. It then said it detected the old Quadro FX1800 GPU but would ignore it (I felt this promising as I need to use the TESLA K20Xm). I chose yes to DKMS module installation, and then after restart I was unable to log-in (login loop logging me back out immediately). Most forums report the login loop is to be corrected by either removing nVidia drivers or checking permissions on .Xautority and .XICEauthority (both had correct permissions). Removal of the drivers didn’t fix the problem, I had to remove all of the display manager and lightdm packages and re-install, but then I was left without the GPU drivers. I tried the run file installation again, but after having blacklisted Nouveau. The GPU didn’t work (nvidia-smi didn’t run correctly, despite path set) and my display settings were very poor.
Thinking that perhaps the nVidia Quadro FX graphics card was compounding the problem, I replaced this with an ATI Radeon graphics card ("[AMD/ATI] RV710/M92 [Mobility Radeon HD 4530/4570/545v]"). I installed the Radeon drivers and had it running nicely (with high res), and then re-attempted the -440 drivers installation again via run file. When this failed, I then removed everything and tried by adding a PPA and installing using apt-get install nvidia-440 and the CUDA packages, again this did not work. I also tried “ubuntu-drivers autoinstall”, and this didn’t work. This T7500 workstation is now left with broken packages, non-installed GPU and low-res graphics. It appears that the menu bars and icons are now missing from LightDM.
CUDA drivers for Dell T7400 with nVidia TESLA K20Xm GPU
The next machine (which also has a TESLA K20Xm GPU), is a Dell T7400 with an old (but working fine) nVidia Quadro NVS300 (“NVIDIA Corporation GT218 [NVS 300] (rev a2)”). I tried the the CUDA-drivers installation run file, which I understand has options to install drivers, cuda drivers, cuda samples, etc. Again, I first shutdown the lightdm service and blacklisted nouveau. After installation, I experienced similar problems. The graphics drivers were disrupted. I then tried removing the K20 GPU drivers “nvidia-uninstall”, and trying again but using PPA and first 418 and then 430 drivers, again no luck. This workstation also now has the same issue with the login loop.
So my question: i) How does one correctly install the nVidia GPU drivers alongside a graphics card (without disrupting the graphics card drivers as, understandably, the GPU has no graphics port). Also, how does the DKMS work with nVidia drivers, and should I be using that? (All I understand is that it maintains kernal builds which is useful for future updates, re-builds of the kernal).
These two machines are part of a cluster which is running Infiniband without a switch, it was tricky to get this fabric up and running with the Mellanox drivers so I don’t wish to go-about re-installing Ubuntu 16.04 (also a lot of work/configuration has been done since which I don’t want to have to re-do). Can someone please suggest what I can do to get the GPUs running on these machines.
This is driving me crazy, I feel it shouldn’t be this hard!
Help would be greatly appreciated. Many thanks in advance!