Starting XWayland results in loading the nvidia nernel module and powering on the GPU

Environment

  • Arch linux
  • Linux 5.4.1
  • Nvidia 440.36
  • Xorg/XWayland 1.20.6
  • bumblebee from git: 7aa457f
  • Window manager: Sway 1.2 (wayland / wlroots)
  • Dell XPS 15 9560 with GTX 1050 and integrated Intel graphics

Background/setup
I use Wayland via sway, and for the most part I need my Nvidia GPU to be powered off, because when it is powered on it uses about 7W of extra power, and therefore decimates battery life. I have PCIE power management enabled (because it saves up to 6W aswell), so I can’t use bbswitch. I use bumblebee/optirun to run specific apps on the Nvidia GPU from time to time.

Bumblebee has recently added support for disabling all of its own power management methods (bbswitch, vga_switcheroo) and instead adding an option to always unload the nvidia kernel module when optirun finishes running. This allows the kernel to power off the Nvidia GPU, and enable PCIE low power states, while preserving the ability to use the Nvidia GPU via optirun.

See https://github.com/Bumblebee-Project/Bumblebee/pull/983

Configuration
Bumblebee

[driver-nvidia]
KernelDriver=nvidia
# Don't use bbswitch or anything else
PMMethod=none
# Unload Nvidia module after optirun completes to allow the kernel to turn it off and save battery
AlwaysUnloadKernelDriver=true

Xorg: None (I use wayland, Xorg support is through XWayland which does not use configuration files)

Modprobe

# Prevent the nvidia module from loading on boot and turning on the GPU to save battery until it is needed
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset
# Also prevent nouveau from trying
blacklist nouveau

The problem
Whenever one of the following things happens, the nvidia module is loaded, and the GPU is powered on unecessarily:

  • An XWayland process is started
  • An Xorg server is started
  • Steam is started

This results in a large unecessary battery drain.

I can work around the problem with starting X/XWayland by putting the following in my environment:

__EGL_VENDOR_LIBRARY_FILENAMES="/usr/share/glvnd/egl_vendor.d/50_mesa.json"

Besides being a surprising workaround (maybe causing other issues?), this still leaves the problem with steam.

Why does enumerating EGL providers result in loading the Nvidia kernel module and powering on the GPU? If the user blacklisted the module and chose not to load it, it seems to me that it should not be loaded unless the user explicitly does so. I assume there is something else that steam does that similarly calls some library which incidentally loads the kernel module. Again, why? None of these use cases actually use the nvidia GPU - there is no need to turn it on.

Is this intended behaviour?

I should mention that this issue is independent of bumblebee.

If you just want to turn off the Nvidia driver/GPU but maintain the ability to load it manually, and have PCI-E power management enabled (so no bbswitch), It seems any X server that starts (or steam, and maybe other apps) will break the setup (even if they don’t actually use the Nvidia GPU)

I don’t really know why the blacklist file doesn’t work unless you have some systemd unit circumventing it, maybe nvidia-persistenced enabled.
Regarding turning on the nvidia gpu, I would suspect that Xorg itself is probing all available gpus unless the AutoAddGPU xorg.conf option is set to false and the device to use (the igpu) is explicitly specified.

I believe using the AutoAddGPU option works for Xorg setups, but it doesn’t work for XWayland because XWayland does not read Xorg configuration files.

It seems like duplicate functionality to me:

  1. the nvidia module is automatically loaded at boot (unless its blacklisted) and never unloads
  2. probing GPUs in Xorg also loads the kernel module in the case that it is not loaded... why?

Maybe when xorg “probes” available GPUs, the nvidia EGL lib should fail and say there is no available GPU because if the kernel module is not loaded that means that the nvidia GPUs are not “available” (the user blacklisted the module, or something).

Autoloading modules is a kernel feature. Can be disabled either in kernel config on compile time or during runtime with echo 1 > /proc/sys/kernel/modules_disabled
Xorg is not restricted to Linux.

OK, but on Linux, AFAICT, the expectation is that either the kernel modules load automatically, or the user loads the kernel modules they want some other way.

It seems like the nvidia libs taking it upon themselves to load the kernel modules is duplicating built in kernel functionality (or the user’s manual alternatives, whatever they may be), and in the process creating this problem for me where I can’t keep the GPU off.

Good point. Maybe this is intended behaviour because otherwise render offload+auto suspend wouldn’t work. IDK.
http://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/primerenderoffload.html
http://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/dynamicpowermanagement.html
One’s joy the other one’s hell.

I don’t think that’t it. The runtime power management in the driver is supported by the driver. i.e. the kernel module stays loaded, but it manages turning off power to the GPU where necessary.

From your link:

(2) Ensure that the nvidia-drm kernel module is loaded. This should normally happen by default, but you can confirm by running `lsmod | grep nvidia-drm` to see if the kernel module is loaded. Run `modprobe nvidia-drm` to load it.

I don’t think that conflicts with the possibility to unload the driver.

Found the time to reproduce it, really annoying behaviour to override manual blacklist. Even more since it doesn’t seem documented in any way by Nvidia.