(solved) will I be able to use CUDA 9.2 on Fedora 27 with a 750 Ti?

I’m doing a runfile install, trouble is I can’t actually use the bundled 396.37 driver. Actually, I had a lot of trouble with almost every 390.x driver release, only the latest one actually worked for me (390.87):

$ nvidia-smi 
Tue Sep  4 20:19:47 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.87                 Driver Version: 390.87                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 42%   34C    P0     2W /  52W |    513MiB /  1999MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1395      G   /usr/libexec/Xorg                             26MiB |
|    0      1460      G   /usr/bin/gnome-shell                          80MiB |
|    0      1833      G   /usr/libexec/Xorg                            174MiB |
|    0      1962      G   /usr/bin/gnome-shell                         209MiB |
+-----------------------------------------------------------------------------+

I’ve never really understood the version numbers on the drivers. What is 396 (emphasis: 6)? Will there be a 64 bit Linux release for this driver that the 750 Ti can use? I really wanted to use CUDA 9.2, so I went the extra mile and installed a newer Fedora.

It’s kind of surprising to see a minor release deprecate an entire family of cards if that’s what actually happened, but I think I’m just misunderstanding what is going on. Maybe there will be a driver soon?

I’d realllly like to keep Fedora 27. If there will not be a 396 release for my card, 9.1 worked when I had Fedora 25 – can I bypass the whole “unsupported host compiler” and all that stuff somehow?

Basically, Fedora 25 is ancient, I was happy to see Fedora 27 support come in, but now it seems my card is legacy and/or won’t be supported. If I must go back to 9.1, is there a “surefire” way to install 9.1 on Fedora 27 (instead of reverting back to Fedora 25), such that I can get nvcc and friends to cooperate with the fact that I now have GCC 7.3.1?

Thank you for any clarity / guidance, ideally on how to get a 396 driver that I can use with my 750 Ti, or if it won’t happen why ;)

396.37 can be made to work with your 750 Ti

Oh this is great to hear! I assumed it would not be because the driver download page for the 750 Ti still lists 390.87.

So the 396 driver that comes with the runfile. I can in theory use that, or I should find a way to download it separately? Right now what happens is the boot stalls after “starting switch root”.

I’m actively researching how to proceed, just looking for advice on whether I can use the graphics driver that came with the cuda runfile or if I should do something else :)

I assume we are not talking about a laptop here.

The graphics driver that comes with the CUDA 9.2 runfile installer for Fedora 27 should work with GTX 750 Ti.

Start with a clean load of Fedora 27. This assumes you’ll get kernel-dev (dnf install kernel-devel) loaded and also get g++ loaded (dnf install gcc-c++).
Switch to runlevel 3.
Remove the nouveau driver.
Run the runfile installer.

[url]Installation Guide Linux :: CUDA Toolkit Documentation

[url]http://www.nvidia.com/getcuda[/url]

If you want to run the system GUI display off of the GTX 750, then:

  • make sure that is the ONLY gpu in your system (including intel motherboard graphics or anything else) (* of course other scenarios are possible. I’m trying to guide you through the jungle here with the minimal amount of information you have provided)
  • be sure to select “yes” when prompted to install the OpenGL libraries
  • optionally select “yes” when prompted to modify your xorg.conf

Of course you’ll need to go back to runlevel 5 to get the GUI going. Various reboots along the way may be needed.

Don’t have a clean install of the OS and don’t want to reload? Study section 2.7 carefully:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

YMMV

I encourage you to read the entire linux install guide carefully.

Ok, this one was kind of difficult to fix so I’ll post what I did here in case somebody ever benefits from this one day.

TL;DR it was a combination of DKMS trouble, the 390.87 nvidia modules (incorrectly) being in my initramfs, and 396.37 seemingly not capable of supporting this rig.

() Boot failure – boot hangs at “Starting Switch Root” ()

After running the CUDA installer (and also asking it to install the graphics driver), the boot would hang here for me. It’s important to understand that

i) I had already disabled nouveau
ii) I had already installed the driver the website suggested to me for this card / os: 390.87

So what you should do in this scenario is hit ctrl + alt + F2. It may spaz out a little bit (for me the keyboard was really unresponsive, likely a hardware specific thing), but eventually what you need to do is login on the failed boot.

So if ctrl + alt + F2 does nothing, wait a little longer and then try that key combo again. You should be far enough along at this point to be able to get a terminal session.

I logged in as root, and the next useful command for you is to find out what went wrong: journalctl -r

In my case, I had the following error message a few pages down:

NVRM: API mismatch: the client has the version 396.37, but
NVRM: this kernel module has the version 390.87.  Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version

() What this error actually means ()

When you search online you’ll see this show up a few different places. Most of the solutions online are “oh I just reinstalled the nvidia-dkms package and it worked”. This is insufficient for my situation, since I don’t use my package manager to install the graphics driver, I always download and install manually whichever one NVIDIA tells me is the latest driver for my card / OS.

What’s actually going on here is that I installed the DKMS module for the 390.87 driver when I did that before installing CUDA. So what NVRM: API mismatch is saying is “hey, I’ve got a kernel module (dkms in this case, but could also be akmod if you do that) that is for v390.87, but I’m finding libraries for 396.37”.

It’s quite literally an API mismatch :p

() Initial Attempt at Solving ()

So I re-installed the 390.87 graphics driver so I could do some research (reinstall so I could have a GUI again), and what you are supposed to do is remove this module. However, you will also very likely need to recompile the initramfs for the kernel.

# find out which kernel modules are loaded
$ dkms status
nvidia, 390.87, 4.17.17-100.fc27.x86_64, x86_64: installed

# remove that kernel module _for all kernels_
$ dkms remove nvidia/390.87 --all
... a lot of scary output...

Now in theory that should have been enough. I rebooted to runlevel 3, ran nvidia-uninstall, extracted the 396.37 driver from the CUDA installer (./cuda_9.2.148_396.37_linux.run --extract /some/absolute/path/under/my/normal/home/directory). In that folder you should have a NVIDIA-Linux-x86_64-396.37.run graphics driver installer among others.

After this, I now have the 396.37 driver and DKMS module installed, ran nvidia-xconfig, blah blah (the normal graphics installer steps). Yet I arrived at the same error. Specifically, on boot it was still complaining about the fact that I have a 390.87 kernel module, but 396.37 libraries. (same error as before).

() A Theoretically Correct Solution ()

I’m being careful about going through all of my steps here because what I failed to realize is that somehow, even though I removed the 390.87 DKMS module, this stuff ended up in my boot image (almost certainly my fault, since I did have issues trying other 390.xx drivers and probably did something stupid).

So reboot for good measure, go to run level 3, and since we just installed 396.37, we need to kill that kernel module now

# verify the kernel module exists
$ dkms status
nvidia, 396.37, 4.17.17-100.fc27.x86_64, x86_64: installed

# say goodbye for _all kernels_
$ dkms remove nvidia/396.37 --all

# uninstall the nvidia graphics drivers
$ nvidia-uninstall

Reboot to run level 3 again for extra good measure (I think this is necessary since we just removed the kernel module, but I don’t actually understand how all this stuff works). Run level 3 bolded because that’s all you can do right now (we’ve uninstalled the nvidia graphics drivers, but also disabled nouveau previously, so you don’t have any graphics drivers!).

Now that we’re in this state, it’s our chance to rectify this mistake: rebuild the boot image now that all graphics drivers are gone, and no nvidia dkms modules are loaded (check dkms status to make sure, it should probably show nothing).

# MAKE A BACKUP.  We know the 390.87 driver works with this image, so we
# can re-install it if everything fails
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nvidia-390.87.backup.img

# rebuild the image
$ dracut /boot/initramfs-$(uname -r).img

Almost there. We just made a new boot image, so of course – reboot again, to run level 3. At this point, you should just be able to run the 396.37 driver install and it should be good to go. I prefer to keep the NVIDIA*.run graphics driver hanging around on my computer in case things break, and it’s also not clear to me if running it through the CUDA installer actually builds the DKMS module.

YMMV.

() Strange Issues with 396.37 ()

The combination of Fedora 27, Kernel 4.17.17-100.fc27.x86_64, and GNOME 3.26.2 resulted in a somewhat amusing effect with my 750 Ti. The load screen showed up (WOOT!), I logged in, but then about every 10 seconds the background screen would change. Start: my background image. 10 seconds later: pure blue. 10 seconds later: background image. Then blue. Etc.

I could only use the mouse when it was the pure blue screen, but clicking on anything, trying to launch a terminal, ctrl+alt+F2, super button for activities, etc, nothing actually worked (maybe it was the keyboard though, given there were weird mouse problems).

Anyway, since I don’t get an official download link to a graphics driver that CUDA 9.2 needs, I searched around. Negativo17 is currently using 396.54, so I just snagged that.

$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/396.54/NVIDIA-Linux-x86_64-396.54.run

I went through the mantra to uninstall the DKMS module for 396.37, nvidia-uninstall, etc. Reboot to run level 3, install the 396.54 driver, and because I was feeling extra lucky I went ahead and just installed CUDA as well (skipping the graphics driver of course!).

Everything appears to be working – UI works just fine, I can compile / run the samples, etc.

Hopefully somebody will benefit from this one day. I’m hopeful that NVIDIA will officially release a 396.xx driver for my 750 Ti card on Linux. I can’t help but feel that other users were also impacted by the driver bundled with the CUDA 9.2 installer.

Eeek. I forgot to say, thanks @txbob for providing suggestions / guesses to help get me out of the jungle! They were good sanity checks to double-check on my end, you’re help is very much appreciated!!! :)

Fedora 27 doesn’t install the 4.17 kernel by default. AFAIK you have to update/upgrade to that.

Yes, the 4.17 kernel has issues with drivers prior to 396.45

[url]https://devtalk.nvidia.com/default/topic/1036332/cuda-setup-and-installation/cuda-9-2-88-on-fedora-28-/post/5274248/#5274248[/url]

There was no mention of 4.17 in your original posting, and I suggested a clean load of Fedora, which should get you the prior kernel 4.13.9. A clean load of the OS also would not have had any scraps of prior drivers around, such as in the initrd image.

The supported/tested kernel for Fedora 27 is listed in the linux install guide:

[url]https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#system-requirements[/url]

All true things. I updated the kernel before installing the graphics driver, but more specifically Fedora automatically upgrades the kernel, so me updating the first time after install updates the kernel. So unless you blacklist it in dnf.conf, the default is to update the kernel (unlike say Ubuntu or other LTS distros where there’s usually a separate dist-upgrade like command to specifically upgrade the kernel).

If 396.54 didn’t work I would have done a fresh install next and blacklisted the kernel updates for dnf. But I was setting up CUDA last after setting everything else up on this box – that stuff took a lot longer than this did.

Anyway, I’ll do some more research on what the implications of blacklisting kernel updates in Fedora is. I am under the impression that this prevents me from getting potentially many other updates, but do not know. This has never been an issue in the past (kernel version not being in the official listing for Fedora). Perhaps I’m updating earlier than I normally do this time around though, and in the past I’ve just gotten lucky and the support for newer kernels was already fixed xD

Presumably if I had a more modern card a 396.xx driver would have shown up on the driver download page as well. Definitely not complaining about that…I’ve definitely gotten more than my monies worth out of this card :)