Prime render offloading not working through AMDGPU

fnorbenden · March 15, 2024, 1:46am

I have a dual-gpu system with an RTX4070 and a polaris radeon (I think a 440?). This is an opensuse tumbleweed system where I have the rtx set up to detach from linux to be used in a windows 10 VM. All of that works just fine. Each card is hooked up to a different input on the monitor, the radeon is HDMI->HDMI and the RTX is DP->HDMI through a converter. I have stubbed nvidia-drm and it does not load, which means Xorg does not grab the nvidia card for a display.

Any attempts to use render offloading (either vulkan or gl) throw some sort of error. My understanding is that nvidia-drm is not necessary for anything other than driving a display, which I am not interested in having it do on the host (linux) system. If that is incorrect, I would appreciate the correction. Otherwise, I am very confused as to why the rtx can be attached to the system, can be used for non-offloading tasks, and can be selected for offloading, but offloading errors out.

Modules loaded:

nvidia_modeset       1605632  0
nvidia_uvm           6610944  0
nvidia              60370944  2 nvidia_uvm,nvidia_modeset
video                  77824  4 asus_wmi,amdgpu,asus_nb_wmi,nvidia_modeset

confirmation that the RTX is accessible (and the system can use the CUDA cores):

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070        Off |   00000000:08:00.0 Off |                  N/A |
| 30%   44C    P2             27W /  200W |    2137MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     15742      C   python3                                      2132MiB |
+-----------------------------------------------------------------------------------------+

Attempts to offload glx rendering:

> __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxgears
X Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  151 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  0
  Current serial number in output stream:  36

> __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep renderer
X Error of failed request:  BadAlloc (insufficient resources for operation)
  Major opcode of failed request:  151 (GLX)
  Minor opcode of failed request:  5 (X_GLXMakeCurrent)
  Serial number of failed request:  0
  Current serial number in output stream:  31

Attempts to offload vulkan rendering:

> __NV_PRIME_RENDER_OFFLOAD=1 vkcube
Selected GPU 0: NVIDIA GeForce RTX 4070, type: DiscreteGpu
Segmentation fault (core dumped)

Journalctl output:

                                                               #3  0x00007faadce2b8a1 n/a (libnvidia-glcore.so.550.54.14 + 0xe2b8a1)
                                                               #4  0x00007faadc9f5924 n/a (libnvidia-glcore.so.550.54.14 + 0x9f5924)
                                                               #5  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #6  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               
                                                               Stack trace of thread 13849:
                                                               #0  0x00007faade28effe __futex_abstimed_wait_common (libc.so.6 + 0x8effe)
                                                               #1  0x00007faade291d40 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x91d40)
                                                               #2  0x00007faad8a38b3b n/a (libvulkan_radeon.so + 0x238b3b)
                                                               #3  0x00007faad8a485f7 n/a (libvulkan_radeon.so + 0x2485f7)
                                                               #4  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #5  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               
                                                               Stack trace of thread 13852:
                                                               #0  0x00007faade28effe __futex_abstimed_wait_common (libc.so.6 + 0x8effe)
                                                               #1  0x00007faade291d40 pthread_cond_wait@@GLIBC_2.3.2 (libc.so.6 + 0x91d40)
                                                               #2  0x00007faadc9f340c n/a (libnvidia-glcore.so.550.54.14 + 0x9f340c)
                                                               #3  0x00007faadce26255 n/a (libnvidia-glcore.so.550.54.14 + 0xe26255)
                                                               #4  0x00007faadc9f5924 n/a (libnvidia-glcore.so.550.54.14 + 0x9f5924)
                                                               #5  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #6  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               
                                                               Stack trace of thread 13853:
                                                               #0  0x00007faade28effe __futex_abstimed_wait_common (libc.so.6 + 0x8effe)
                                                               #1  0x00007faade292065 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x92065)
                                                               #2  0x00007faadc9f346c n/a (libnvidia-glcore.so.550.54.14 + 0x9f346c)
                                                               #3  0x00007faadce1009d n/a (libnvidia-glcore.so.550.54.14 + 0xe1009d)
                                                               #4  0x00007faadc9f5924 n/a (libnvidia-glcore.so.550.54.14 + 0x9f5924)
                                                               #5  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #6  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               
                                                               Stack trace of thread 13851:
                                                               #0  0x00007faade28effe __futex_abstimed_wait_common (libc.so.6 + 0x8effe)
                                                               #1  0x00007faade292065 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x92065)
                                                               #2  0x00007faadc9f346c n/a (libnvidia-glcore.so.550.54.14 + 0x9f346c)
                                                               #3  0x00007faadce3c681 n/a (libnvidia-glcore.so.550.54.14 + 0xe3c681)
                                                               #4  0x00007faadc9f5924 n/a (libnvidia-glcore.so.550.54.14 + 0x9f5924)
                                                               #5  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #6  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               
                                                               Stack trace of thread 13854:
                                                               #0  0x00007faade28effe __futex_abstimed_wait_common (libc.so.6 + 0x8effe)
                                                               #1  0x00007faade292065 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x92065)
                                                               #2  0x00007faadc9f346c n/a (libnvidia-glcore.so.550.54.14 + 0x9f346c)
                                                               #3  0x00007faadcf076d4 n/a (libnvidia-glcore.so.550.54.14 + 0xf076d4)
                                                               #4  0x00007faadcef47b6 n/a (libnvidia-glcore.so.550.54.14 + 0xef47b6)
                                                               #5  0x00007faadc9f5924 n/a (libnvidia-glcore.so.550.54.14 + 0x9f5924)
                                                               #6  0x00007faade292bb2 start_thread (libc.so.6 + 0x92bb2)
                                                               #7  0x00007faade31400c __clone3 (libc.so.6 + 0x11400c)
                                                               ELF object binary architecture: AMD x86-64

generix · March 15, 2024, 9:18am

drm stands for Direct Rendering Manager, without nvidia-drm, rendering won’t work.

fnorbenden · March 15, 2024, 9:05pm

Doh, thanks for setting me straight.

I only plugged the module because it was impossible to unload even when the rtx was excluded from driving displays, which in turn meant the rtx couldn’t be detached for the VM. Any thoughts on how to achieve that outside of restarting the Linux host DE?

generix · March 15, 2024, 9:59pm

That won’t work due to several reasons.

fnorbenden · March 16, 2024, 12:32am

If the module isn’t driving a display and has no processes it’s attached to, why wouldn’t you be able to unload the module like the other Nvidia kernel modules? I get that it gets upset when you try, but I’m struggling to understand why.

generix · March 16, 2024, 1:09am

What you want is a hot unplug situation which is shitty with linux, something always keeps a hold on the gpu and won’t let go. Regarding hot unplug, Xorg is shitty, Wayland as well. The linux drm infra is shitty, the nvidia driver as well.

fnorbenden · March 22, 2024, 2:44am

I did a little bit more research and I have an update: like you said, nvidia-drm is necessary, so I’m no longer stubbing it. Instead I am turning nvidia kms off and removing the egl spec file (15_nvidia_gbm.json) from /usr/share/egl/egl_external_platform.d/ . This prevents wayland and gnome from grabbing the nvidia card. It can now be swapped without any issues within the same DM session using virt-manager’s automatic detach/attach process.

Still can’t get prime render offloading working when the rtx is attached to the host. Same errors as before. However, this time all of the driver modules (nvidia, *-drm, *-uvm, *-modesetting) are loaded. Is modesetting on the nvidia driver a requirement for prime offloading?

I have seen at least one report of a similar setup working by specifying the egl/glx file. I tried redirecting to a different path where I moved the egl file and no dice. Doing the same with vulkan is also suggested but for some reason tumbleweed doesn’t ship with an nvidia icd (even on my optimus laptop with fully-functional prime offloading, I can’t find it anywhere).

generix · March 22, 2024, 8:19am

That’s one of the issues. while disabling nvidia drm kms is one prerequisite to unload the nvidia gpu, it’s at the same time needed for many offloading scenarios. Also with subtle difference between intel (i915) or amd (amdgpu) being the target.
The vulkan icd should be in /usr/share/vulkan/icd.d

fnorbenden · March 22, 2024, 2:43pm

That’s where I’m confused, because there are reports of that setup working. I wonder if it’s just vulkan offloading that works. I haven’t been able to test that since I can’t find an ICD file to aim it at. I’ll keep looking into that.

I do wonder if there’s any way to completely exclude a GPU from attaching to wayland. I know there isn’t (really) in x11, but the nvidia support for wayland is so new I honestly have no idea. I would think keeping the drm kms loaded but finding some way to keep EGL and wayland from attaching seems like it would work. Like I said, I have no idea how that would look.

I suppose another option would be to use bumblebee and eat the performance cost, since bumblebee works on its own display that can be destroyed without impacting the user session.

Topic		Replies	Views
PRIME Render Offload doesn't work on Ubuntu 18.04 with the latest driver Linux	10	2558	October 12, 2021
Issue with PRIME render offload on wayland Linux	8	8329	December 13, 2021
Dual GPU Intel-Nvidia / Prime Render Offloading / Ubuntu 20.04 -- does not offload Linux ubuntu	9	6452	January 27, 2021
PRIME offloading: Unable to run chrome Linux	13	9796	October 12, 2021
PRIME render offloading on Nvidia Optimus Linux	79	52377	February 10, 2021
Ubuntu 18.04 Uses LLVM instead of NVIDIA Drivers for OpenGL Linux ubuntu	6	4052	December 31, 2020
PRIME offloading not working on Ubuntu with Nvidia 440 driver Linux	3	1971	March 2, 2020
Lenovo ThinkPad P52 with modesetting and nvidia not working and xf86-video-intel with bumblebee and ... Linux	11	5802	October 12, 2021
PRIME render offloading not working on my arch linux on a lenovo laptop with xfce+xorg Linux	5	4248	October 12, 2021
Prime offload configuration change from 396.24 and up Linux	17	2076	October 16, 2018

Prime render offloading not working through AMDGPU

Related topics