Dual GPU Intel-Nvidia / Prime Render Offloading / Ubuntu 20.04 -- does not offload

Hello,

I hope this will also help other folks in a similar situation. I have custom-built setup (no laptop !) with both the Intel iGPU and Nvidia GTX 1080 as dGPU on Ubuntu 20.04

Context
The primary goal for the Nvidia card was to run CUDA programs such as Hashcat but still using the Intel iGPU for display and everything else (and keeping power draw to bare minimum)

Hashcat succesfully works using Nvidia, however the Prime On-Demand / offloading as such does not seem to work.
I’ve tinkered around for hours and would appreciate if you can shed lights on what’s wrong.
Note, this server is mainly “headless” but still set up as a desktop, mainly accessing from remote with e.g. NoMachine (I put a dummy DVI adapter to makes things easy)

Current install

  • nvidia propriatory drivers installed (nvidia-headless) (version 460)
  • Xorg files created / modified in such a way that the Screen is associated to the iGPU with “modesetting” as driver
  • Nouveau blacklisted
  • PRIME set to On-demand
  • See commands output below and attached nvidia-debug file
  • I also disabled GPU-manager (by adding nogpumanager to Grub) so that /usr/share/X11/xorg.conf.d/11-nvidia-prime.conf don’t get overwritten

$ sudo lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)

$ nvidia-smi
Tue Jan 26 21:26:43 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 23% 34C P0 34W / 180W | 0MiB / 8119MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ prime-select query
on-demand

Issue

  • I only have 1 provider
  • When I “force” Render offloading, it does not work, see below with glxinfo
  • In idle mode, power draw is always around 35W (as you can see in nvidia-smi). While I understand that (for now ?) GTX cannot benefit from proper power management, why is this so high ? Any way to reduce that / turn off (without having to force Prime-Intel) ?

xrandr --listproviders
Providers: number : 1
Provider 0: id: 0x45 cap: 0x0 crtcs: 3 outputs: 3 associated providers: 0 name:modesetting

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -iE ‘OpenGL renderer|vendor’
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
Vendor: Mesa/X.org (0xffffffff)
OpenGL vendor string: Mesa/X.org
OpenGL renderer string: llvmpipe (LLVM 11.0.0, 256 bits)

$ glxinfo | grep -iE ‘OpenGL renderer|vendor’
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
Vendor: Mesa/X.org (0xffffffff)
OpenGL vendor string: Mesa/X.org
OpenGL renderer string: llvmpipe (LLVM 11.0.0, 256 bits)

Cause / Remediation

  • Could it be because Prime Render Offloading is (still) not available in Xorg ? I’m on Ubuntu 20.04.1, see Xorg version below
  • Misconfiguration ?

Thank you very much !

Xorg config

nvidia-bug-report.log.gz (304.3 KB) > $ cat /etc/X11/xorg.conf.d/10-intel.conf

Section "ServerLayout"
    Identifier    "layout"
    Screen    0   "intelscreen"
    #Screen    1   "nvidiascreen"
    Option    "AllowNVIDIAGPUScreens"
    #Inactive "nvidiadevice"
EndSection

Section "Device"
    Identifier     "inteldevice"
    Driver         "modesetting"
    BusID          "PCI:0:2:0"
    Option	   "AccelMethod"   "sna"
EndSection

Section "Device
    Identifier    "nvidiadevice"
    Driver        "nvidia"
    BusID         "PCI:1:0:0"
    Option "ConstrainCursor" "off"
    Option        "Coolbits"       "28"
    Option        "AllowEmptyInitialConfiguration"
EndSection

Section "Screen"
    Identifier "intelscreen"
    Device    "inteldevice"
EndSection

Section "Screen"
   Identifier    "nvidiascreen"
   Device        "nvidiadevice"
   Option "IgnoreDisplayDevices" "CRT"
EndSection

$ cat /usr/share/X11/xorg.conf.d/11-nvidia-prime.conf
# DO NOT EDIT. AUTOMATICALLY GENERATED BY gpu-manager

Section "OutputClass"
    Identifier "Nvidia Prime"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    Option "IgnoreDisplayDevices" "CRT"
    #Option "PrimaryGPU" "Yes" <<< commented out by me
    # added nogpumanager to grub otherwise it gets overwritten
    ModulePath "/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

# added this section
# not sure how it interacts with /etc/X11/xorg.conf.d/

Section "OutputClass"
    Identifier "intel"
    MatchDriver "i915"
    Driver "modesetting"
    #Driver "intel"
    Option "PrimaryGPU" "yes"    
EndSection

It doesn’t look like the NVIDIA driver is installed correctly. The X server can’t find the NVIDIA X driver component (nvidia_drv.so):

Typically that goes in /usr/lib/xorg/modules/drivers/nvidia_drv.so but Ubuntu might move things around (such as putting it in /lib/x86_64-linux-gnu/nvidia/xorg).

Thanks for the swift reply ! Indeed nvidia_drv.so is in neither directory (or anywhere else for that matter)

Could it be because I installed nvidia-headless-460 which by definition does not need X server related features ?

I’m going to try to install the regular nvidia-driver-460 package on top and see what it’s like

I therefore uninstalled nvidia-headless and re-installed but this time the full package and through PPA:graphic-drivers - latest version i.e. 460.39 and all my issues are fixed, however a new one appeared:

  • nvidia_drv.so is found in **/usr**/lib/x86_64-linux-gnu/nvidia/xorg
  • render offloading works ok:

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep -iE 'OpenGL renderer|vendor’
server glx vendor string: NVIDIA Corporation
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce GTX 1080/PCIe/SSE2

$ glxinfo | grep -iE 'OpenGL renderer|vendor’
server glx vendor string: SGI
client glx vendor string: Mesa Project and SGI
Vendor: Intel Open Source Technology Center (0x8086)
OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel® HD Graphics 4000 (IVB GT2)

  • Power drain average is now down to 5W:

  • BUT as you can see in nvidia-smi, there are now two (Xorg) processes running under nvidia
    How can this be ?
    This was not the case before.
    Any idea on how this can be fixed ? It seems to defeat the purpose of offloading if I’m not totally mistaken (?)

Cheers,

$ nvidia-smi

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 Off | N/A |
| 28% 28C P8 5W / 180W | 11MiB / 8119MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1141 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2170 G /usr/lib/xorg/Xorg 4MiB |
±----------------------------------------------------------------------------+

$ prime-select query
on-demand

New bug-report: nvidia-bug-report.log.gz (328.2 KB)

Are there two X servers running? You can check with pgrep -a Xorg. It’s likely that one of them is the GDM greeter session that’s used for user switching, and the other one is your actual desktop session.

It does sound like offloading is working as intended. Please note that while the GPU can drop to its lowest active power state (P8, at 5 Watts as you noted), systems with Pascal or older GPUs such as the GTX 1080 cannot power the GPU off completely in render offload mode. That kind of power saving mode (referred to as “runtime D3” or RTD3) is only supported on laptops with Turing (the 16 and 20 series) and up. Please see Chapter 22. PCI-Express Runtime D3 (RTD3) Power Management in the README for more information.

  • if I connect through SSH (knowing there is a “monitor” attached to it) then I have one process
  • as I also connect remotely / graphically (using NoMachine) then another process shows up:

$ pgrep -a Xorg
1181 /usr/lib/xorg/Xorg vt1 -displayfd 3 -auth /run/user/120/gdm/Xauthority -background none -noreset -keeptty -verbose 3
4225 /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3

  • I understand the limitations in terms of power management with GTX cards (though we can hope they will be supported soon ?)
    however and even though right now it’s not a huge burden, I’d like to understand why Xorg is taken care of by Nvidia instead of Intel.

There’s something wrong, and beyond my very case, it’s interesting / important to understand I guess.

That nvidia-smi output just means that the Xorg process has the NVIDIA device file open. It doesn’t mean that your whole desktop is being rendered by the NVIDIA GPU. So it’s expected to show up there and it’ll only actually be used by applications you explicitly choose to offload to it (or they select the NVIDIA GPU with the Vulkan graphics API). So power usage should remain low (although not zero) unless an offloaded OpenGL, Vulkan, or EGL application is actually using it.

As far as I can tell from what you’ve posted, it’s working as intended.

OK it’s clear.
I guess my issue is solved and I hope it will help anyone with double GPU to set up their nvidia device accordingly.
Therefore this post can be closed. Thanks a lot for your support !

One more thing though.
I feel that tinkering with Xorg config in such situation is more trial and error than anything else (playing around with “ServerLayout”, “Screen”, “Device” etc without clear indications/understanding of what is actually needed or not and why in a On-Demand situation). Would be very grateful as a community if documentation could clarify this.

It works for me right now but I don’t really know why.

With modern X servers on a hybrid laptop where Intel or AMD is the primary GPU device, you actually get a render offload configuration by default if you don’t have any other configuration files to change the behavior. Ubuntu just goes out of its way with gpu-manager to configure things explicitly.

You could probably achieve the same configuration you have now, basically, by just deleting /etc/X11/xorg.conf and /usr/share/X11/xorg.conf.d/11-nvidia-prime.conf and making sure gpu-manager is disabled.

  • Note it’s not a laptop but a custom built computer where I just added a PCIe GTX 1080 but that should not matter.

  • I’ve just tried to delete/disable any *.conf file related to nvidia/intel in both directories
    and I’m back to only 1 provider (…name:modesetting), offload does not work etc like in my first post

So it seems some sort of Xorg config is required - and as stated there does not seem to be any clear approach as to what this should be.
So I’m back to playing around with xorg.conf-like files :-D

  • Another weird thing is, upon logon (gdm3) the screen is flickering for like 10-15 seconds after entering the password until the desktop shows up.
    This never happened before installing nvidia drivers, and is kinda weird because Intel is supposed to be handling this and this didn’t change

See attached vid

Don’t get me wrong, “I can live with that”, but again it’s one of those things we’d like to understand so there can be a predictive behavior. Who knows what else could suddenly behave weirdly