OpenGL context cannot be created after Nvidia driver update, cannot reboot the system

This problem happens every time Nvidia driver is updated to a new version. After the Nvidia driver update, when I try to reboot the system the ksmserver via Qt tries to create an OpenGL context to render reboot/logout UI, fails and crashes. As a result, there is no way I can reboot the system except by killing X server, which can have detrimental consequences for the running GUI applications.

There is this KDE bug about this problem:

https://bugs.kde.org/show_bug.cgi?id=364593

There, the KDE developer suggests that Nvidia driver should not report that it supports OpenGL if it can’t create OpenGL contexts. This way Qt will be able to use the software rendering instead.

This is a long time problem with KDE and Nvidia drivers, please fix.

Update:

There is a Qt bug here:

https://bugreports.qt.io/browse/QTBUG-67535

nvidia-bug-report.log.gz (134 KB)

You’re kind of taking away the files the driver needs while it’s running. I don’t know if this is really a bug or just expected behavior.

Anyway, you can use a command to safely log out of KDE without displaying the box:

qdbus org.kde.ksmserver /KSMServer logout 0 0 0

or if you want to reboot:

qdbus org.kde.ksmserver /KSMServer logout 0 1 0

You can create aliases like “kde-logout” or “kde-reboot” so you don’t need to remember the magic numbers.

This problem doesn’t exist with other drivers, e.g. from Intel, so at least we know it can be made to work.

And the source of the problem is not that I removed some files of the driver. By updating it, I replaced it with the new versions. AFAIU, the driver breaks because it contains an internal ABI check, which I think basically verifies that the kernel driver has exactly the same version as the userspace components. The problem is that this check happens too late.

This is not solution

The onus for this is partly on the user. It’s your responsibility to shut down and unload the nvidia driver before updating it. If you pulled this stunt on Windows the driver would crash just the same. In Linux you have more control, but just because you can do something doesn’t mean you should.

I don’t understand Christoph’s response to this on the kde bug, either. glXCreateContext is the first GLX interaction with the driver. It can’t “report” that it’s working before that, and there are no other mechanisms to detect the driver’s status, despite their claims. If it fails, it fails there. The API says it can fail, so I don’t get why they aren’t more robust in handling the error.

The onus for this is partly on the user. It’s your responsibility to shut down and unload the nvidia driver before updating it.

I disagree. I suppose, you could argue this is a question of quality of software, be that the Nvidia driver, the installer/packaging software or KDE/Qt. But asking people to manually terminate their GUI session to update the driver is unreasonable in 2018. Some distros don’t even provide an easy way to boot into console session; an average desktop user shouldn’t be expected to hack his system settings just to update the driver.

If you pulled this stunt on Windows the driver would crash just the same.

AFAIK, on Windows you can freely update Nvidia drivers from GUI without crashes. There isn’t any other way, even. Since Vista, it doesn’t even ask you to reboot and reloads the new driver on the running system.

I don’t understand Christoph’s response to this on the kde bug, either. glXCreateContext is the first GLX interaction with the driver.

I’m not familiar with OpenGL programming, but if that’s the case then the problem needs to be fixed in Qt. I would like to hear from a Nvidia representative on this problem - maybe they will simply remove the ABI check and this will fix the crash.

Windows changes the graphics driver to a fallback before replacing the current one, then switches back. In our case, that would be the responsibility of the package manager, but we dont have that kind of integration. The driver could try to replace its kernel module when updated, but what if the kernel changed during the package update?

You can’t just remove the ABI check, nothing would work. You could keep a small stub that’s ABI-compatible with just the necessities, and that’s basically what the driver does to report an error on context creation instead of just crash.

Or Qt could check the return value and fall back cleanly to a software output instead of just looking for a running GLX driver and assuming it’ll work. If the driver tried to pull its GLX driver out of memory on an ABI mismatch, all the programs still running from the dynamic loader would suddenly crash. For all the stupid things the nvidia driver does, I don’t think it’s to blame for this one.

The driver could try to replace its kernel module when updated, but what if the kernel changed during the package update?

The loaded kernel is never changed until reboot. DKMS will (should) build the kernel modules for every installed kernel. So updating the kernel is not a problem. The problem is that the currently loaded kernel module is not reloaded. I assume, it cannot be reloaded because this would mean any running GUI applications would need to be closed.

You can’t just remove the ABI check, nothing would work. You could keep a small stub that’s ABI-compatible with just the necessities, and that’s basically what the driver does to report an error on context creation instead of just crash.

As I said, this works with drivers from other vendors. Whether they maintain a stable ABI or have some other solution - I don’t know.

I suspect that most of the time ABI is actually compatible between different versions of Nvidia drivers, so at least minor updates could be made successful relatively easily, IMHO. Though, that still leaves the question what to do when ABI has to change.

If the driver tried to pull its GLX driver out of memory on an ABI mismatch, all the programs still running from the dynamic loader would suddenly crash.

I’m not asking for this. I don’t even think this is possible. Basically, I see two ways to solve this:

  • Make the userspace components ABI-compatible with the kernel driver regardless of the version and remove the ABI check. I think, this is the approach taken by Linux kernel and other driver vendors.

  • Make sure the current session works with the old userspace components until reboot. I’m not sure how exactly to do this. Maybe make the necessary OpenGL libraries as hardlinks, which get updated on reboot. Or use some sort of system daemon that keeps the old libraries from being deleted until reboot.

Much more stable ABI, by keeping more of it in userspace. When the ABI does change, Mesa can fall back to a software context.

All running applications will work fine, since they’ve loaded the libraries. But how do you instruct newly opened applications which libraries they’re supposed to use? These types of solutions make things more convoluted. Can’t we just have Qt, instead of relying on Mesa’s undefined behavior and saying “quit” when it gets a NULL context, say “fall back to software?” Let’s not enable lazy developers who don’t want to handle errors gracefully.

But how do you instruct newly opened applications which libraries they’re supposed to use?

If the libraries are hardlinks, the new applications would keep using the old libraries until the hardlinks are updated. Which would happen on reboot - whether by a startup script or a running daemon.

Note that I’m not arguing that KDE/Qt is doing the right thing. Crashing the application, especially the one that the user is supposed to use to reboot the system after an update, is not the right thing to do. That’s why I initially created a KDE bug and now also a Qt bug (https://bugreports.qt.io/browse/QTBUG-67535).

The reason why I think it is worth fixing on the driver side is that (a) the rest of the world works differently and better (i.e. doesn’t crash) and (b) keeping OpenGL working after the driver update seems like a reasonable and useful thing to have, even if to avoid inconsistencies and slowdowns from switching to the software rendering.