Regression 565 PRIME offloading

Hi, been a while.

I’m not really asking for direct help on this: I installed the 550 driver from the .run file, so I can just sit on it for a bit. This is more what can I do to help you all fix this issue, so I can go back to using a driver from the package manager.

Games appear to crash with both DXVK and VKD3D when NVIDIA is the graphics renderer and an AMD is the primary Xorg renderer. This happens after a seemingly random amount of time spent in the application. I’ve seen it crash almost immediately, sometimes it takes 2 or 10 or 15 minutes or more to crash. The crash manifests as the application freezing, the audio thread carries on for a second before it realizes the rest of the application locked up, and I have to kill it myself in a terminal or through Steam. When NVIDIA is the primary Xorg renderer, nothing appears to break (although for my setup this is undesirable).

modesetting, intel, and amdgpu Xorg drivers appear to make no difference. Disabling my Intel CPU’s turbo and running at base clocks didn’t help. Running the application in gamescope fixes the issue (but is undesirable since I gain a lot of frametime.) Switching from CachyOS (Arch derivative) to Fedora 41 didn’t help. Using the latest lts kernel doesn’t help.

565.77 appears to have this issue, and 550.142 does not. I’m not sure what drivers in between are problematic, but the issue is more recent than not if my memory serves. I figured that after all the troubleshooting I did, if downgrading the NVIDIA driver clearly fixes it, then there’s the smoking gun.

dmesg, journalctl, dxvk’s log, and Xorg’s log have thus far been empty with regards to this, but I can still upload one of those if needed (especially if someone can help me enable more verbose logging on any of those).

The games I have tried thus far are HELLDIVERS 2 and Lethal Company, which are Windows applications. HD2 in particular I can run with either DXVK or VKD3D. I’m not sure I know of a native, intensive Vulkan application to test that I have.

Output of inxi -GSC -xx:

System:
  Host: qykopi Kernel: 6.12.7-200.fc41.x86_64 arch: x86_64 bits: 64
    compiler: gcc v: 2.43.1-5.fc41
  Desktop: MATE v: 1.28.2 wm: marco dm: LightDM Distro: Fedora Linux 41
    (MATE-Compiz)
CPU:
  Info: 6-core model: 11th Gen Intel Core i5-11600K bits: 64 type: MT MCP
    arch: Rocket Lake rev: 1 cache: L1: 480 KiB L2: 3 MiB L3: 12 MiB
  Speed (MHz): avg: 800 min/max: 800/5000 cores: 1: 800 2: 800 3: 800 4: 800
    5: 800 6: 800 7: 800 8: 800 9: 800 10: 800 11: 800 12: 800 bogomips: 93888
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 550.142 arch: Ampere
    pcie: speed: 2.5 GT/s lanes: 16 bus-ID: 01:00.0 chip-ID: 10de:24b0
  Device-2: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380]
    vendor: Micro-Star MSI driver: amdgpu v: kernel arch: GCN-3 pcie:
    speed: 8 GT/s lanes: 4 ports: active: DP-1,DVI-I-1 empty: DVI-D-1,HDMI-A-1
    bus-ID: 04:00.0 chip-ID: 1002:6939 temp: 62.0 C
  Device-3: MACROSILICON Hagibis
    driver: hid-generic,snd-usb-audio,usbhid,uvcvideo type: USB rev: 3.2
    speed: 5 Gb/s lanes: 1 bus-ID: 2-3.2:3 chip-ID: 345f:2130
  Display: x11 server: X.Org v: 21.1.15 with: Xwayland v: 24.1.4
    compositor: marco v: 1.28.0 driver: X: loaded: amdgpu,nvidia dri: radeonsi
    gpu: amdgpu display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 4480x1440 s-dpi: 96
  Monitor-1: DVI-I-1 pos: left model: Sun Microsystems GDM-5410
    res: 1920x1440 dpi: 122 diag: 500mm (19.7")
  Monitor-2: DP-1 mapped: DisplayPort-0 pos: primary,right
    model: Dell S2716DG res: 2560x1440 dpi: 109 diag: 686mm (27")
  API: OpenGL v: 4.6 vendor: amd mesa v: 24.3.2 glx-v: 1.4 es-v: 3.2
    direct-render: yes renderer: AMD Radeon R9 380 Series (radeonsi tonga LLVM
    19.1.5 DRM 3.59 6.12.7-200.fc41.x86_64) device-ID: 1002:6939
  API: Vulkan v: 1.3.296 surfaces: xcb,xlib device: 0 type: discrete-gpu
    driver: N/A device-ID: 1002:6939 device: 1 type: discrete-gpu driver: N/A
    device-ID: 10de:24b0 device: 2 type: cpu driver: N/A device-ID: 10005:0000
  API: EGL Message: EGL data requires eglinfo. Check --recommends.

This issue appears to still be present in 570.86.16 Beta. Thus I take it that the following change note is unrelated: Fixed a bug that could cause external displays to become frozen until the next modeset when using PRIME Display Offloading with the NVIDIA dGPU acting as the display offload sink. – which figures since that doesn’t totally describe my issue. The display is still working the whole time; the window that the game is in however is no longer receiving frames as it appears to be deadlocked.

Tried installing either the proprietary module or the open source ones, neither make any difference. I went back and tried a couple other versions: 565.77 still has the same issue (sanity check), and 560.35.03 fails to build against kernel 6.12.11.

EDIT: Some more context: my setup has changed slightly, and I have some better debugging info for this maybe. With the Intel UHD 750 as the primary Xorg renderer via modesetting, the NVIDIA driver seems to choke on video games much more consistently than it did paired with AMD.

You should run nvidia-bug-report.sh and upload the resulting nvidia-bug-report.log.gz file, otherwise it will be very difficult for anyone to help. It’s probably best to run it right after a freeze, but before you kill the frozen program.

I have an Intel iGPU and an Nvidia eGPU and few times I was using PRIME to offload Starcraft-2 using wine+DXVK and didn’t have any issues. My system is Debian trixie, kernel 6.12.x, driver 570.86.15, wine 10~rc2, DXVK 2.5.3.

Well I guess thanks for the suggestion and the “works on my computer.”

nvidia-bug-report.log.gz (716.2 KB)

HELLDIVERS 2 was locked at the time nvidia-bug-report.sh was ran. This instance had been running for less than a minute. In fact the Steam overlay init notification was still in the bottom right corner.

Discovered that it can unfreeze itself, but it’s incredibly difficult to control and will freeze again quite often. Resizing the window or moving it to another screen, for example, will cause this to happen.

I noticed that nvidia_drm.modeset was not enabled (cat /sys/module/nvidia_drm/parameters/modeset responeded with N). This did not make any difference after it was enabled and responded with Y. I also enabled early KMS loading for amdgpu i915 nvidia nvidia_modeset nvidia_uvm nvidia_drm, but again no change.

Turning off interlacing on the AMD display adapter made no difference, nor did changing resolution to 800x600.

Log after freezing and unfreezing several times, and the mentioned kernel options changed:
nvidia-bug-report.log.gz (717.7 KB)

570.124.04 is affected

I think I’m seeing this problem too? This exactly describes the issues I’ve been having.
inxi -GSC -xx:

  Host: berseliusArtix Kernel: 6.13.8-zen1-1-zen arch: x86_64 bits: 64
    compiler: gcc v: 14.2.1
  Desktop: Xfce v: 4.20.1 tk: Gtk v: 3.24.48 wm: xfwm4 dm: SDDM
    Distro: Artix base: Arch Linux
CPU:
  Info: 8-core model: AMD Ryzen 7 4800H with Radeon Graphics bits: 64
    type: MT MCP arch: Zen 2 rev: 1 cache: L1: 512 KiB L2: 4 MiB L3: 8 MiB
  Speed (MHz): avg: 1397 min/max: 1400/2900 boost: enabled cores: 1: 1397
    2: 1397 3: 1397 4: 1397 5: 1397 6: 1397 7: 1397 8: 1397 9: 1397 10: 1397
    11: 1397 12: 1397 13: 1397 14: 1397 15: 1397 16: 1397 bogomips: 92624
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: NVIDIA TU117M [GeForce GTX 1650 Ti Mobile] vendor: Lenovo
    driver: nvidia v: 570.133.07 arch: Turing pcie: speed: 2.5 GT/s lanes: 8
    ports: active: none empty: DP-1,HDMI-A-1,eDP-2 bus-ID: 01:00.0
    chip-ID: 10de:1f95
  Device-2: Advanced Micro Devices [AMD/ATI] Renoir [Radeon Vega Series /
    Radeon Mobile Series] vendor: Lenovo driver: amdgpu v: kernel arch: GCN-5
    pcie: speed: 16 GT/s lanes: 16 ports: active: eDP-1 empty: none
    bus-ID: 05:00.0 chip-ID: 1002:1636 temp: 41.0 C
  Device-3: Syntek Integrated Camera driver: uvcvideo type: USB rev: 2.0
    speed: 480 Mb/s lanes: 1 bus-ID: 1-3:3 chip-ID: 174f:244c
  Display: x11 server: X.org v: 1.21.1.16 with: Xwayland v: 24.1.6
    compositor: xfwm4 v: 4.20.0 driver: X: loaded: modesetting,nvidia
    unloaded: fbdev,nouveau failed: vesa alternate: amdgpu,nv dri: radeonsi
    gpu: amdgpu display-ID: :0.0 screens: 1
  Screen-1: 0 s-res: 1920x1080
  Monitor-1: eDP-1 model: AU Optronics 0xd1ed res: 1920x1080 hz: 120
    dpi: 142 diag: 394mm (15.5")
  API: EGL v: 1.5 platforms: device: 0 drv: nvidia device: 1 drv: radeonsi
    device: 2 drv: nouveau device: 3 drv: swrast gbm: drv: kms_swrast
    surfaceless: drv: nvidia x11: drv: radeonsi inactive: wayland
  API: OpenGL v: 4.6.0 compat-v: 4.5 vendor: amd mesa v: 25.0.3-arch1.1
    glx-v: 1.4 direct-render: yes renderer: AMD Radeon Graphics (radeonsi
    renoir ACO DRM 3.61 6.13.8-zen1-1-zen) device-ID: 1002:1636
  API: Vulkan v: 1.4.309 surfaces: xcb,xlib device: 0 type: discrete-gpu
    driver: N/A device-ID: 10de:1f95 device: 1 type: integrated-gpu driver: N/A
    device-ID: 1002:1636
  Info: Tools: api: eglinfo, glxinfo, vulkaninfo de: xfce4-display-settings
    gpu: nvidia-settings,nvidia-smi x11: xprop,xrandr

Attached bug report is from vkmark -p fifo, which is a native application.
nvidia-bug-report.log.gz (2.1 MB)

Also, while Gamescope does stabilize it, I’ve still seen freezes using Gamescope that seem related, just much less frequently.

Seemingly no change with 570.144.

I did, however, launch the Wayland version of my desktop environment on a whim, and for the time I spent in Wayland this bug did not appear to occur. Upon switching back to xorg, the bug was still there. I tried launching a Weston session and running a problematic application inside of it (vkmark again), and that also did not seem to suffer from this problem.

I’ll say I think it’s possible that the freezes I saw while using Gamescope are a different problem (possibly related to the application I saw them occur in). If I see one again I’ll try to get a bug report in case that can distinguish it as unrelated (since if they’re unrelated it would appear this is somehow related to xorg?), or at least be a significantly different test case.

If there’s anything I can do to help track this down, please let me know!

575.51.02 not resolved