Warning about CUDA 9.2

Hi all,

This post is to pre-emptively add some institutional knowledge to the web in case others face the same problem.

I had the following configuration:

Ubuntu 18.04
nvidia-driver-396 from Ubuntu
CUDA 9.0
cuDNN 7.1
Tensorflow 1.8

I wanted to upgrade to Tensorflow 1.11 which is built against cuDNN 7.2 which requires CUDA 9.2. I downloaded CUDA 9.2 (local) (deb). It completely took a crap on my entire system rendering it unusuable. The reason is that the CUDA local deb tries to install its own version of nvidia-396 ON TOP OF Ubuntu’s nvidia-driver-396.

I’m posting this here in the hope that if anyone has similar issues they’ll be able to, in their frantic Googling, find this post, because I didn’t find the answer in my Googling.

The answer is:
DO NOT DO NOT DO NOT DO NOT DO NOT do this as the nvidia docs say:
sudo apt-get install cuda-9-2
Instead do this:
sudo apt-get install cuda-toolkit-9-2
And you should be fine.

I wish there were an “apt-get install nvidia-fix-this-shit” package that would figure these things out. We can drive cars with NVIDIA autonomously but we can’t install their CUDA drivers autonomously. -_____-

Please note that per NVIDIA’s support matrix CUDA 9.x does not support Ubuntu 18.04. CUDA 10 does.

Sure, I understand, but it is necessary to force-install it, against NVIDIA’s support matrix, for the vast majority of people using TensorFlow. My setup is probably one of the most common out there.

TensorFlow 1.11.0 is released 4 days ago, built against cuDNN 7.2:
https://github.com/tensorflow/tensorflow/releases

cuDNN 7.2 is only available for CUDA 9.2:
https://developer.nvidia.com/rdp/cudnn-archive

CUDA 10 isn’t an option at the moment, and it isn’t my fault. Anyone who wants to use the latest TensorFlow on 18.04 without going through compilation hell needs to force install CUDA 9.2 on their system and will likely stumble upon the same issue.

Just posting this on here as a piece of information for those who want to get on with their lives in doing deep learning instead of figuring out how to install drivers ;) This wasn’t an issue with CUDA 9.0 but is with 9.2; with the recent TensorFlow 1.11 release 4 days ago I expect there will be a few thousand person-hours of AI research power that will be wasted on the same driver issue.

There is a famous song for this situation (find it on YouTube if you are so inclined): “You can’t always get what you want”.

There is no point in complaining that something does not work if the vendor says that such usage is not supported. “not supported” means: We haven’t tried it, it probably won’t work, don’t come crying to us when it doesn’t.

Are you being forced to use Ubuntu 18.04? If so, I’d be interested to learn by whom.

I’m not complaining. It works! I’m just posting information about how to get an unsupported situation to work, since this is a common scenario.

A lot of people have been using TF 1.8 + CUDA 9.0 on Ubuntu 18.04 which is also unsupported, but exceedingly common, and installation goes mostly without a hitch. There is this one hitch for TF 1.11 + CUDA 9.2 that I posted about. That’s it.

At the end of the day I need to get back to training my models. Not being told that my career is unsupported by some matrix. So I’ll make it work, somehow, even if it’s unsupported. Hence, see above. That’s how I have always worked and how most researchers work. We make square pegs fit in round holes and then publish how to do it.

(Of course, if NVIDIA wants to listen to us and support this common configuration, or Google wants to listen to us and support CUDA 10, that would be super awesome, but I don’t have time to research models AND deal with convincing those 2 companies to cooperate.)

Sharing information on workarounds is cool, of course. I took the following as a superfluous complaint, though:

Serious question: How would research be impeded if you were to use Ubuntu 17.04 with CUDA 9.2 (a supported configuration)?

Point taken, I was a bit frustrated, sorry.

17.04 – I’m hesitant to use non-LTM releases since we don’t use non-LTM on production servers and robots, and I want my workstation’s software stack is one that is plausibly deployable in the future. So it’s either 16.04 or 18.04.

18.04 is since I have a few other software packages that are either dropping support for 16.04, not supporting 16.04, or planning EOLing support for 16.04 soon. ROS Melodic, for one, but there are others. I also use GIMP 2.10 which is a hundred times easier to use for fixing some annotations than previous releases of GIMP – and that isn’t supported on 16.04.