Nvidia-driver 510 with render offload kernel BUG on ubuntu 20.04 LTS on Dell Inc. Precision 3530

Hi,
I have updated to nvidia-driver-510 and since then I have started seeing errors in dmesg such as below. When I revert to driver-470, it will work as expected - no errors there. I have attempted to install latest 510.54 from nvidia, but problem persists.
Dell Inc. Precision 3530 comes with nvidia P600 I have as prime render offload source. I am able to trigger the crash by starting up-to-date steam client - which will crash and stays zombified.

[Mar 5 15:47] ------------[ cut here ]------------
[ +0,000002] kernel BUG at drivers/gpu/drm/drm_gem.c:154!
[ +0,000003] invalid opcode: 0000 [#2] SMP PTI
[ +0,000002] CPU: 7 PID: 10196 Comm: steam Tainted: P D OE 5.4.0-100-generic #113-Ubuntu
[ +0,000001] Hardware name: Dell Inc. Precision 3530/0T36NT, BIOS 1.18.0 12/09/2021
[ +0,000010] RIP: 0010:drm_gem_private_object_init+0xa2/0xb0 [drm]
[ +0,000001] Code: 00 31 c0 c1 e9 03 f3 48 ab 48 c7 43 18 00 00 00 00 48 c7 83 c0 00 00 00 00 00 00 00 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb b2 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 89
[ +0,000001] RSP: 0018:ffffa77c0731fcb0 EFLAGS: 00010206
[ +0,000001] RAX: ffff9c5e57647368 RBX: ffff9c5e57647358 RCX: 0000000000000200
[ +0,000000] RDX: 0000000000000200 RSI: ffff9c5e57647200 RDI: ffff9c5e8a729800
[ +0,000001] RBP: ffffa77c0731fcd8 R08: ffff9c5dc8b1e108 R09: ffff9c5dc8b1e108
[ +0,000001] R10: ffff9c5e68110008 R11: 0000000000000001 R12: ffff9c5e57647200
[ +0,000000] R13: 0000000000000200 R14: ffff9c5e8a729800 R15: ffff9c5e8e76b100
[ +0,000001] FS: 0000000000000000(0000) GS:ffff9c5ebc5c0000(0063) knlGS:00000000f7812b40
[ +0,000001] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ +0,000000] CR2: 00000000573a8000 CR3: 0000000763390006 CR4: 00000000003606e0
[ +0,000001] Call Trace:
[ +0,000003] ? nv_drm_gem_object_init+0x54/0x60 [nvidia_drm]
[ +0,000002] __nv_drm_nvkms_gem_obj_init+0xab/0xf0 [nvidia_drm]
[ +0,000002] ? nv_drm_dumb_create+0x1a0/0x1a0 [nvidia_drm]
[ +0,000001] nv_drm_gem_import_nvkms_memory_ioctl+0x89/0x110 [nvidia_drm]
[ +0,000002] ? nv_drm_dumb_create+0x1a0/0x1a0 [nvidia_drm]
[ +0,000005] drm_ioctl_kernel+0xae/0xf0 [drm]
[ +0,000100] ? _nv037891rm+0xac/0x1a0 [nvidia]
[ +0,000009] drm_ioctl+0x24a/0x3f0 [drm]
[ +0,000001] ? nv_drm_dumb_create+0x1a0/0x1a0 [nvidia_drm]
[ +0,000003] ? __check_object_size+0x13f/0x150
[ +0,000084] ? nvidia_ioctl+0x39b/0x8d0 [nvidia]
[ +0,000008] drm_compat_ioctl+0xcb/0xe0 [drm]
[ +0,000002] __ia32_compat_sys_ioctl+0x194/0x220
[ +0,000003] do_fast_syscall_32+0x9d/0x260
[ +0,000002] entry_SYSENTER_compat+0x7f/0x91
[ +0,000001] RIP: 0023:0xf7ef0b49
[ +0,000001] Code: c4 8b 04 24 c3 8b 14 24 c3 8b 1c 24 c3 8b 34 24 c3 8b 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
[ +0,000000] RSP: 002b:00000000ff8bf0d8 EFLAGS: 00000286 ORIG_RAX: 0000000000000036
[ +0,000001] RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00000000c0206441
[ +0,000001] RDX: 00000000ff8bf19c RSI: 0000000000000001 RDI: 00000000573817a0
[ +0,000000] RBP: 00000000f6a6c084 R08: 0000000000000000 R09: 0000000000000000
[ +0,000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ +0,000001] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ +0,000001] Modules linked in: nvidia_uvm(OE) rfcomm nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ccm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6>
[ +0,000010] snd_soc_core libarc4 snd_compress ac97_bus snd_pcm_dmaengine rapl intel_cstate serio_raw cdc_ether usbnet wmi_bmof intel_wmi_thunderbolt r8152 dell_wmi_descriptor snd_hda_intel snd_usb_audio mii btusb snd_intel_dspcfg snd_hda_codec btrtl btbcm uv>
[ +0,000015] ghash_clmulni_intel rtsx_pci_sdmmc aesni_intel mxm_wmi crypto_simd nvme i2c_algo_bit cryptd glue_helper e1000e drm_kms_helper syscopyarea sysfillrect sysimgblt nvme_core i2c_i801 fb_sys_fops rtsx_pci thunderbolt drm intel_lpss_pci intel_lpss ahci>
[ +0,000008] —[ end trace 22248365b6802bd8 ]—
[ +0,019787] RIP: 0010:drm_gem_private_object_init+0xa2/0xb0 [drm]
[ +0,000004] Code: 00 31 c0 c1 e9 03 f3 48 ab 48 c7 43 18 00 00 00 00 48 c7 83 c0 00 00 00 00 00 00 00 5b 41 5c 5d c3 4c 89 a3 f0 00 00 00 eb b2 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 55 48 89
[ +0,000001] RSP: 0018:ffffa77c06397cb0 EFLAGS: 00010206
[ +0,000002] RAX: ffff9c5dd9bfed68 RBX: ffff9c5dd9bfed58 RCX: 0000000000000200
[ +0,000014] RDX: 0000000000000200 RSI: ffff9c5dd9bfec00 RDI: ffff9c5e8a729800
[ +0,000001] RBP: ffffa77c06397cd8 R08: ffff9c5eb3d56688 R09: ffff9c5eb3d56688
[ +0,000000] R10: ffff9c5e68110008 R11: 0000000000000001 R12: ffff9c5dd9bfec00
[ +0,000001] R13: 0000000000000200 R14: ffff9c5e8a729800 R15: ffff9c5e8e76b100
[ +0,000001] FS: 0000000000000000(0000) GS:ffff9c5ebc5c0000(0063) knlGS:00000000f7812b40
[ +0,000001] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ +0,000000] CR2: 00000000573a8000 CR3: 0000000763390006 CR4: 00000000003606e0

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Ah, I found out that nvidia-bug-report.sh was missing, so I have reinstalled the driver including redirecting glvnd config to /etc/glvnd/egl_vendor.d .
sudo ~/Downloads/NVIDIA-Linux-x86_64-510.54.run --sanity ## now report SUCCESS
but kernel-bug is still persistent.
nvidia-bug-report.log.gz (628.0 KB)

According to the log, you’re not running in offload mode but with nvidia as primary.

Damn, I had no idea how messy this little issue may become! Since installing 470 version drivers, the way I have been abusing __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only to assure OpenGL and EGL have been using nVidia runtime - but that stopped working with 510 drivers causing kernel bugs.
I have been trying to configure my Xorg to use both cards in the way it was expected with render offload.
Until now, i have been using prime-select to nvidia or later to on-demand. All the time I have been able to see both cards in xrandr --listproviders or vulkaninfo.
But I am hitting Xorg errors due unable to initialize intel card (ending with “modeset(G0): Failed to create pixmap”) when attempted to make intel card primary one .

Finally I succeeded making intel card a primary one after running prime-select intel- it then miraclously was able to run Xorg. But it comes at a cost - no nvidia kernel modules are available now, thus xrandr or vulkaninfo see only intel DRM card now. Modprobe nvidia is failing with could not find module by name='off'which would be result of prime-select. I wish there was more documentation around such tools.

I am dissapointed how difficult getting this technology proved to be (not to mention ubuntu’s gpu-manager I disabled to be able configure Xorg). I guess I am just going to revert to 470 drivers and keep abusing render offload ENVs the way I did before.
Can you check my bugreport if my configuration is wrong?
nvidia-bug-report.log.gz (170.3 KB)

1 Like

You fiddled to much with pointless config, now nothing really works. To revert to a sane config, please delete
/usr/share/X11/xorg.conf.d/11-nvidia-prime.conf
/usr/share/X11/xorg.conf.d/10-intel.conf
then remove “Option PrimaryGpu” from
/usr/share/X11/xorg.conf.d/10-nvidia.conf
but don’t delete the file.
Finally, remove “nogpumanager” from kernel cmdline and use prime-select to switch to the desired mode.

I had it almost well! At last I have working intel graphics with offloading to nvidia!
I deleted /usr/share/X11/xorg.conf.d/10-nvidia.conf as its not useful unless I want nvidia to be primary.
I added /etc/modprobe.d/nvidia-modeset.conf
blacklist nvidia-drm
options nvidia-drm modeset=1
It probably allows i915 to initialize properly, and despite blacklisted, nvidia-drm is loaded with nvidia module later anyway.
I switched again to prime-select on-demand.
I needed to fix gpu-manager generated config in /usr/share/X11/xorg.conf.d/11-nvidia-prime.conf to use proper xorg driver /usr/lib/x86_64-linux-gnu/nvidia/xorg – /usr was missing!
I am certainly keeping gpu-manager disabled, xorg does dynamic configuration fine without it.

Uh, after upgrade to latest ubuntu’s nvidia-driver package 510.54, my config stopped working again. Modesetting driver was still attempting driving my nvidia card (using nouveau). So I needed to return to BusID override, so my prime-settings look like this:

Section "OutputClass"
    Identifier "Nvidia Prime"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    Option "IgnoreDisplayDevices" "CRT"
    Option "PrimaryGPU" "No"
    ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

Section "Device"
	Identifier "nvidia"
	Driver "nvidia"
	BusID "PCI:1:0:0"
	Option "AccelMethod" "none" # no glamore for nvidia
EndSection

Section "Device"
  Identifier "intel"
  Driver "modesetting"
  BusID "PCI:0:2:0"
  Option "kmsdev" "/dev/dri/card0"
#  Option "AccelMethod" "none"
EndSection

It probably duplicates OutputClass with Device, but OutputClass alone would not work for me (or I have it wrong).
nvidia-bug-report.log.gz (818.5 KB)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.