Keeping Built In Drivers From Nvidia Toolkit Up To Date On Ubuntu

Hi everyone,

I’ll start with some background to explain why I’m asking this particular question. I have an HPC compute server, and its entire purpose is to run research calculations with CUDA acceleration. I updated the server to Ubuntu 14.04 (it was previously running Ubuntu 12.04) and ran into a major issue with the drivers. If it helps, this server does not have a desktop installed, and the video output does not use the same card as the ones used for calculations.

I was using the .deb install which sets up a CUDA PPA, and I was using drivers provided by the Ubuntu repositories. I was able to install multiple different different version of the CUDA toolkit through this PPA and numerous versions of the Nvidia drivers. In all, I tried dozens of combinations of Nvidia driver + CUDA version. In every case, I was able to install CUDA and the graphics driver, compile the software I use (Gromacs), but I had catastrophic errors when it came time to actually run the software. In numerous cases, the failure actually caused a lock up in the system that required a reboot. I tried a combination of the .run file installation with drivers provided by Ubuntu repositories with similar results. Now there are many, many possible points of failure, so I wasn’t pointing any fingers at any one thing.

Fortunately, I stumbled across a solution that seemed to work. I used the combination of the .run file installation and the drivers that came provided within that package. It worked perfectly. Now I’ve come to my current problem.

The CUDA toolkit installs a driver that ONLY works for the current kernel version when you install it. Ubuntu sometimes updates its kernel once a week. Right now, my workaround is to modify the grub settings to always boot from the version of the kernel that I had installed when I finally got CUDA working. However, in the long run this is bad practice as the kernel updates are primarily for security. I’m wondering if anyone knows a relatively simple way to update this version of the driver for a given kernel. I’d prefer to stick with ONE version of the driver which goes along with ensuring that my research results are as reproducible as possible. Is there perhaps something in the .run file installation that would install ONLY the driver and not the rest of CUDA? I’m a bit afraid of just trying the .run file again myself as it took me days to randomly stumble across a combination that actually worked. I’d really be unhappy if I ended up breaking things all over again by stumbling around in the installer with no idea what I’m doing.

Thanks for any suggestions.

There are 2 canonical methods for driver installation: runfile and package manager. You don’t want to mix these methods on a given machine. I say this because you start out talking about setting things up using .deb and then transition to talking about runfiles.

If you use a runfile installer (for both CUDA and the driver), then the answer to your question is yes, you can install just the driver without the rest of CUDA.

The CUDA toolkit runfile installer (not accessed via .deb or package manager methods!) has prompts that allow you to deselect the toolkit and SDK/samples installation, but select yes to the bundled driver. If you do this (say, after a kernel update, and assuming the kernel update also included the necessary headers), then the runfile installer should re-install the driver, which will recompile the kernel module interface for your new kernel. Your statement that the driver packaged with the CUDA toolkit only works for a particular kernel version refers I think to the kernel that was installed when the toolkit installer was run, at least with respect to any driver installed via runfile installer. It is correct that if you update the kernel after installing the driver, it will break things. But that same driver can be made to work if you re-run the runfile installer, as it will recompile the kernel interface.

Even if you didn’t want to use the particular driver (e.g. 352.39) that came bundled with a CUDA toolkit (runfile) installer, you can download a newer driver runfile installer from www.nvidia.com

And to repeat, do not mix package manager and runfile installer methods. Your problem is also probably solvable using a package manager approach, but I’ve chosen to limit my comments to runfile installer methods above to limit confusion, respond to your direct question, and because that is what I am most familiar with.

You may want to read the cuda toolkit installation guide, as the package manager/runfile pitfalls are covered there:

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract

Hi txbob,

Thanks, that’s exactly what I wanted to know.

And you’re correct, I didn’t explicitly state it in my post since it was dragging on already, but before changing to the runfile method of installation, I completely uninstalled everything from the .deb method. That includes the nvidia drivers provided by the Ubuntu repositories, CUDA provided by the PPA, and the PPA itself. The only exception is that two of my earlier attempts involved using nvidia drivers from Ubuntu’s repositories (tried two different versions) with the runfile installation of CUDA. Similar terrible results.

In addition, you’re also correct that when I was referring to the driver not working with an updated kernel, I was referring to the binary nvidia module itself (which of course doesn’t exist for updated kernels). Not the driver version. My intention is just like you mentioned: to reinstall the same driver version from the same CUDA runfile install so that it simply compiles against the updated kernel headers.

In Ubuntu 12.04, I was able to get this working using Ubuntu’s nvidia drivers + the CUDA .deb (PPA) installation AND using Ubuntu’s nvidia drivers + the CUDA runfile installation. Obviously never at the same time. I had never had to resort to installing the nvidia drivers directly from nvidia. It’s not something that I’m a huge fan of since installing core components of a Linux distro from third parties is generally much more of a headache than it’s worth (IMO).

Just in case anyone comes along that has a similar issue and is interested, I’ll say that I’m reasonably convinced that there is no “simple” way of fixing my problem using the package manager approach. I tried every version of the nvidia drivers from the Ubuntu repository combined with CUDA 6.5 from the PPA (.deb approach), and I tried every version of the nvidia drivers from the Ubuntu repository combined with CUDA 7.0 from the PPA (.deb approach). Due to the fact that running the CUDA accelerated software resulted in completely locking up the system, it seems like a failure in the nvidia driver is the most likely issue since it’s the closest part to the kernel. However, the simple fact of the matter is that the issue could have been caused by so many things: the particular version of Gromacs, the particular model of GPU, bug in the version of GCC I was using, kernel, perhaps a shell issue, possibly some permission issue… among many other possibilities.