Ok, this one was kind of difficult to fix so I’ll post what I did here in case somebody ever benefits from this one day.
TL;DR it was a combination of DKMS trouble, the 390.87 nvidia modules (incorrectly) being in my initramfs, and 396.37 seemingly not capable of supporting this rig.
() Boot failure – boot hangs at “Starting Switch Root” ()
After running the CUDA installer (and also asking it to install the graphics driver), the boot would hang here for me. It’s important to understand that
i) I had already disabled nouveau
ii) I had already installed the driver the website suggested to me for this card / os: 390.87
So what you should do in this scenario is hit
ctrl + alt + F2. It may spaz out a little bit (for me the keyboard was really unresponsive, likely a hardware specific thing), but eventually what you need to do is login on the failed boot.
ctrl + alt + F2 does nothing, wait a little longer and then try that key combo again. You should be far enough along at this point to be able to get a terminal session.
I logged in as root, and the next useful command for you is to find out what went wrong:
In my case, I had the following error message a few pages down:
NVRM: API mismatch: the client has the version 396.37, but
NVRM: this kernel module has the version 390.87. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version
() What this error actually means ()
When you search online you’ll see this show up a few different places. Most of the solutions online are “oh I just reinstalled the
nvidia-dkms package and it worked”. This is insufficient for my situation, since I don’t use my package manager to install the graphics driver, I always download and install manually whichever one NVIDIA tells me is the latest driver for my card / OS.
What’s actually going on here is that I installed the DKMS module for the 390.87 driver when I did that before installing CUDA. So what NVRM: API mismatch is saying is “hey, I’ve got a kernel module (dkms in this case, but could also be akmod if you do that) that is for v390.87, but I’m finding libraries for 396.37”.
It’s quite literally an API mismatch :p
() Initial Attempt at Solving ()
So I re-installed the 390.87 graphics driver so I could do some research (reinstall so I could have a GUI again), and what you are supposed to do is remove this module. However, you will also very likely need to recompile the initramfs for the kernel.
# find out which kernel modules are loaded
$ dkms status
nvidia, 390.87, 4.17.17-100.fc27.x86_64, x86_64: installed
# remove that kernel module _for all kernels_
$ dkms remove nvidia/390.87 --all
... a lot of scary output...
Now in theory that should have been enough. I rebooted to runlevel 3, ran
nvidia-uninstall, extracted the 396.37 driver from the CUDA installer (
./cuda_9.2.148_396.37_linux.run --extract /some/absolute/path/under/my/normal/home/directory). In that folder you should have a
NVIDIA-Linux-x86_64-396.37.run graphics driver installer among others.
After this, I now have the 396.37 driver and DKMS module installed, ran nvidia-xconfig, blah blah (the normal graphics installer steps). Yet I arrived at the same error. Specifically, on boot it was still complaining about the fact that I have a 390.87 kernel module, but 396.37 libraries. (same error as before).
() A Theoretically Correct Solution ()
I’m being careful about going through all of my steps here because what I failed to realize is that somehow, even though I removed the 390.87 DKMS module, this stuff ended up in my boot image (almost certainly my fault, since I did have issues trying other 390.xx drivers and probably did something stupid).
So reboot for good measure, go to run level 3, and since we just installed 396.37, we need to kill that kernel module now
# verify the kernel module exists
$ dkms status
nvidia, 396.37, 4.17.17-100.fc27.x86_64, x86_64: installed
# say goodbye for _all kernels_
$ dkms remove nvidia/396.37 --all
# uninstall the nvidia graphics drivers
Reboot to run level 3 again for extra good measure (I think this is necessary since we just removed the kernel module, but I don’t actually understand how all this stuff works). Run level 3 bolded because that’s all you can do right now (we’ve uninstalled the nvidia graphics drivers, but also disabled nouveau previously, so you don’t have any graphics drivers!).
Now that we’re in this state, it’s our chance to rectify this mistake: rebuild the boot image now that all graphics drivers are gone, and no nvidia dkms modules are loaded (check
dkms status to make sure, it should probably show nothing).
# MAKE A BACKUP. We know the 390.87 driver works with this image, so we
# can re-install it if everything fails
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nvidia-390.87.backup.img
# rebuild the image
$ dracut /boot/initramfs-$(uname -r).img
Almost there. We just made a new boot image, so of course – reboot again, to run level 3. At this point, you should just be able to run the 396.37 driver install and it should be good to go. I prefer to keep the NVIDIA*.run graphics driver hanging around on my computer in case things break, and it’s also not clear to me if running it through the CUDA installer actually builds the DKMS module.
() Strange Issues with 396.37 ()
The combination of Fedora 27, Kernel 4.17.17-100.fc27.x86_64, and GNOME 3.26.2 resulted in a somewhat amusing effect with my 750 Ti. The load screen showed up (WOOT!), I logged in, but then about every 10 seconds the background screen would change. Start: my background image. 10 seconds later: pure blue. 10 seconds later: background image. Then blue. Etc.
I could only use the mouse when it was the pure blue screen, but clicking on anything, trying to launch a terminal,
ctrl+alt+F2, super button for activities, etc, nothing actually worked (maybe it was the keyboard though, given there were weird mouse problems).
Anyway, since I don’t get an official download link to a graphics driver that CUDA 9.2 needs, I searched around. Negativo17 is currently using 396.54, so I just snagged that.
$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/396.54/NVIDIA-Linux-x86_64-396.54.run
I went through the mantra to uninstall the DKMS module for 396.37,
nvidia-uninstall, etc. Reboot to run level 3, install the 396.54 driver, and because I was feeling extra lucky I went ahead and just installed CUDA as well (skipping the graphics driver of course!).
Everything appears to be working – UI works just fine, I can compile / run the samples, etc.
Hopefully somebody will benefit from this one day. I’m hopeful that NVIDIA will officially release a 396.xx driver for my 750 Ti card on Linux. I can’t help but feel that other users were also impacted by the driver bundled with the CUDA 9.2 installer.