535.129.03 freeze system, crash or getting nuts on RTX3050

Hi,

All drivers since 535.98 are not working on my machine, I got freeze, display glitch …
I’m playing Diablo 4, the game is working perfectly with 535.98, but when I try to update a lot of issues appears.

Please have a look to kernel.log we clearly see the stack error :

2023-11-05T10:11:14.789878+01:00 morrowind kernel: [  231.298095]  ? __die+0x23/0x70
2023-11-05T10:11:14.789878+01:00 morrowind kernel: [  231.298099]  ? page_fault_oops+0x171/0x4f0
2023-11-05T10:11:14.789878+01:00 morrowind kernel: [  231.298102]  ? _nv013176rm+0xc1/0x130 [nvidia]
2023-11-05T10:11:14.789878+01:00 morrowind kernel: [  231.298328]  ? exc_page_fault+0x7f/0x180
2023-11-05T10:11:14.789878+01:00 morrowind kernel: [  231.298332]  ? asm_exc_page_fault+0x26/0x30
2023-11-05T10:11:14.789879+01:00 morrowind kernel: [  231.298338]  ? _nv043160rm+0x1d/0x40 [nvidia]
2023-11-05T10:11:14.789879+01:00 morrowind kernel: [  231.298505]  _nv016185rm+0xd0/0x120 [nvidia]
2023-11-05T10:11:14.789879+01:00 morrowind kernel: [  231.298667]  _nv045221rm+0x5e9/0x690 [nvidia]
2023-11-05T10:11:14.789879+01:00 morrowind kernel: [  231.298830]  _nv045216rm+0x6c/0x80 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299015]  _nv045243rm+0x61/0xb0 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299178]  _nv043394rm+0x95/0x100 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299317]  _nv000681rm+0x6c/0x80 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299469]  rm_cleanup_file_private+0x135/0x200 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299619]  nvidia_close+0x157/0x300 [nvidia]
2023-11-05T10:11:14.789880+01:00 morrowind kernel: [  231.299723]  nvidia_frontend_close+0x2b/0x50 [nvidia]
2023-11-05T10:11:14.789881+01:00 morrowind kernel: [  231.299829]  __fput+0xf2/0x2a0
2023-11-05T10:11:14.789881+01:00 morrowind kernel: [  231.299832]  task_work_run+0x5a/0x90

1°) Crash/Freeze : Here a bug report after reboot + kernel logs :
nvidia-bug-report_after_frozen_screen.log.gz (768.3 KB)
kernel.log.tar.gz (63.1 KB)

2°) Freeze, but I was able to “recover” and here the results :
nvidia-bug-report.log.gz (869.4 KB)

Thanks for highlighting issue to us, could you please share reliable repro steps to repro issue at our end…

launch Diablo 4 using Proton Eggroll : here
I tried with 8.22 and 8.3 issues occurs. About 8.22 I’m not sure, but there might be also an issue with it.

Do you need more informations about my system ?

New crash happen
kernel.log (123.7 KB)

Hi @amrits

any update on the crash ? Have you any clue of the issue ? Is it NVIDIA driver issue ? Linux Kernel issue ? BIOS issue ?

The crash still happening on my side. I came most of the time after resuming from Sleep.

Steps :

  • Resuming from Sleep
  • Launch Diablo 4
  • The game launch after 1 or 2 minutes system froze
  • Doing a hard reboot, try again and it works.

Thank you

Hi @amrits,

I got another kind of crash (same configuration) :

2023-12-03T13:50:01.356322+01:00 morrowind kernel: [50390.808853] BUG: unable to handle page fault for address: 0000000100000018
2023-12-03T13:50:01.356332+01:00 morrowind kernel: [50390.808858] #PF: supervisor read access in kernel mode
2023-12-03T13:50:01.356332+01:00 morrowind kernel: [50390.808860] #PF: error_code(0x0000) - not-present page
2023-12-03T13:50:01.356333+01:00 morrowind kernel: [50390.808861] PGD 0 P4D 0 
2023-12-03T13:50:01.356333+01:00 morrowind kernel: [50390.808864] Oops: 0000 [#1] PREEMPT SMP NOPTI
2023-12-03T13:50:01.356334+01:00 morrowind kernel: [50390.808866] CPU: 9 PID: 46042 Comm: brave Tainted: P           OE      6.5.0-2-amd64 #1  Debian 6.5.6-1
2023-12-03T13:50:01.356334+01:00 morrowind kernel: [50390.808868] Hardware name: ASUS System Product Name/PRIME B650-PLUS, BIOS 1811 10/07/2023
2023-12-03T13:50:01.356334+01:00 morrowind kernel: [50390.808870] RIP: 0010:_nv042956rm+0x1d/0x40 [nvidia]
2023-12-03T13:50:01.356335+01:00 morrowind kernel: [50390.809057] Code: 00 00 00 44 89 c0 c3 66 0f 1f 44 00 00 66 0f 1f 00 48 8b 47 18 48 85 c0 74 29 48 39 f>
2023-12-03T13:50:01.356335+01:00 morrowind kernel: [50390.809059] RSP: 0018:ffff9fa2471e3aa0 EFLAGS: 00010286
2023-12-03T13:50:01.356336+01:00 morrowind kernel: [50390.809061] RAX: 0000000100000000 RBX: ffff914b7ac85a48 RCX: ffff914cde5c3808
2023-12-03T13:50:01.356336+01:00 morrowind kernel: [50390.809062] RDX: ffffffffffffffd8 RSI: ffff914cd2f4d030 RDI: ffff914cde5c3830
2023-12-03T13:50:01.356336+01:00 morrowind kernel: [50390.809063] RBP: ffff914b7ac859f0 R08: ffffffffffffffd8 R09: ffff914b7ac85940
2023-12-03T13:50:01.356342+01:00 morrowind kernel: [50390.809064] R10: 00000000000380a0 R11: ffff914e4c076008 R12: 0000000000000000
2023-12-03T13:50:01.356342+01:00 morrowind kernel: [50390.809065] R13: ffff914cde5c3830 R14: ffff914b66fd8008 R15: ffff915010195808

RIP: 0010:_nv042956rm+0x1d/0x40 [nvidia]

Hi,

Same crash with drivers : 535.146.02

2023-12-17T23:12:18.962014+01:00 morrowind kernel: [  156.232807] RIP: 0010:_nv043176rm+0x1d/0x40 [nvidia]
2023-12-17T23:12:18.962015+01:00 morrowind kernel: [  156.232986] Code: 00 00 00 44 89 c0 c3 66 0f 1f 44 00 00 66 0f 1f 00 48 8b 47 18 48 85 c0 74 29 48 39 f>
2023-12-17T23:12:18.962015+01:00 morrowind kernel: [  156.232987] RSP: 0018:ffffb75488adbb18 EFLAGS: 00010286
2023-12-17T23:12:18.962015+01:00 morrowind kernel: [  156.232989] RAX: 000000001bd83000 RBX: ffff891742bbaa48 RCX: 0000000000000000
2023-12-17T23:12:18.962015+01:00 morrowind kernel: [  156.232990] RDX: ffffffffffffffd8 RSI: ffff891b7ccbd430 RDI: ffff891bbc643030
2023-12-17T23:12:18.962016+01:00 morrowind kernel: [  156.232991] RBP: ffff891742bba9f0 R08: ffffffffffffffd8 R09: ffff891742bba940
2023-12-17T23:12:18.962016+01:00 morrowind kernel: [  156.232992] R10: 00000000000380a0 R11: ffff8915814e2008 R12: 0000000000000000
2023-12-17T23:12:18.962016+01:00 morrowind kernel: [  156.232993] R13: ffff891bbc643030 R14: ffff8915b0900008 R15: ffff891bc6664808
2023-12-17T23:12:18.962016+01:00 morrowind kernel: [  156.232994] FS:  00000000003e2000(0063) GS:ffff891c9dec0000(006b) knlGS:00000000f7f34700
2023-12-17T23:12:18.962016+01:00 morrowind kernel: [  156.232995] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
2023-12-17T23:12:18.962028+01:00 morrowind kernel: [  156.232996] CR2: 000000001bd83018 CR3: 000000073ce22000 CR4: 0000000000750ee0
2023-12-17T23:12:18.962029+01:00 morrowind kernel: [  156.232997] PKRU: 55555554
2023-12-17T23:12:18.962029+01:00 morrowind kernel: [  156.232998] Call Trace:
2023-12-17T23:12:18.962029+01:00 morrowind kernel: [  156.233002]  <TASK>
2023-12-17T23:12:18.962030+01:00 morrowind kernel: [  156.233004]  ? __die+0x23/0x70
2023-12-17T23:12:18.962030+01:00 morrowind kernel: [  156.233008]  ? page_fault_oops+0x171/0x4e0
2023-12-17T23:12:18.962030+01:00 morrowind kernel: [  156.233012]  ? exc_page_fault+0x7f/0x180
2023-12-17T23:12:18.962030+01:00 morrowind kernel: [  156.233015]  ? asm_exc_page_fault+0x26/0x30
2023-12-17T23:12:18.962030+01:00 morrowind kernel: [  156.233021]  ? _nv043176rm+0x1d/0x40 [nvidia]
2023-12-17T23:12:18.962031+01:00 morrowind kernel: [  156.233184]  _nv016187rm+0xd0/0x120 [nvidia]
2023-12-17T23:12:18.962031+01:00 morrowind kernel: [  156.233346]  _nv047230rm+0xa4/0x110 [nvidia]
2023-12-17T23:12:18.962031+01:00 morrowind kernel: [  156.233563]  _nv010782rm+0x51/0x1a0 [nvidia]
2023-12-17T23:12:18.962031+01:00 morrowind kernel: [  156.233771]  _nv018451rm+0x49/0x3d0 [nvidia]
2023-12-17T23:12:18.962032+01:00 morrowind kernel: [  156.233975]  _nv002410rm+0xd/0x20 [nvidia]
2023-12-17T23:12:18.962032+01:00 morrowind kernel: [  156.234149]  _nv004110rm+0x16/0xb0 [nvidia]
2023-12-17T23:12:18.962032+01:00 morrowind kernel: [  156.234314]  _nv016162rm+0x52c/0x690 [nvidia]
2023-12-17T23:12:18.962032+01:00 morrowind kernel: [  156.234490]  _nv043516rm+0xab/0xe0 [nvidia]
2023-12-17T23:12:18.962032+01:00 morrowind kernel: [  156.234623]  _nv045238rm+0xa9/0x130 [nvidia]
2023-12-17T23:12:18.962033+01:00 morrowind kernel: [  156.234787]  _nv045237rm+0x3e5/0x690 [nvidia]
2023-12-17T23:12:18.962033+01:00 morrowind kernel: [  156.234948]  _nv043418rm+0xd5/0x160 [nvidia]
2023-12-17T23:12:18.962033+01:00 morrowind kernel: [  156.235080]  _nv043419rm+0x41/0x70 [nvidia]
2023-12-17T23:12:18.962033+01:00 morrowind kernel: [  156.235210]  _nv000567rm+0x4a/0x60 [nvidia]
2023-12-17T23:12:18.962033+01:00 morrowind kernel: [  156.235341]  _nv000715rm+0x1b7/0xe70 [nvidia]
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235492]  rm_ioctl+0x58/0xb0 [nvidia]
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235641]  nvidia_ioctl+0x5d8/0x880 [nvidia]
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235745]  nvidia_frontend_compat_ioctl+0x3c/0x60 [nvidia]
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235851]  __do_compat_sys_ioctl+0xc3/0x1a0
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235855]  __do_fast_syscall_32+0x86/0xe0
2023-12-17T23:12:18.962034+01:00 morrowind kernel: [  156.235858]  ? srso_alias_return_thunk+0x5/0x7f
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235860]  ? syscall_exit_to_user_mode+0x2b/0x40
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235862]  ? srso_alias_return_thunk+0x5/0x7f
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235864]  ? __do_fast_syscall_32+0x95/0xe0
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235866]  do_fast_syscall_32+0x33/0x80
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235867]  entry_SYSCALL_compat_after_hwframe+0x6d/0x75
2023-12-17T23:12:18.962035+01:00 morrowind kernel: [  156.235870] RIP: 0023:0xf7f9d579
2023-12-17T23:12:18.962036+01:00 morrowind kernel: [  156.235871] Code: c4 01 10 03 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 0>
2023-12-17T23:12:18.962036+01:00 morrowind kernel: [  156.235873] RSP: 002b:00000000007ff674 EFLAGS: 00000292 ORIG_RAX: 0000000000000036
2023-12-17T23:12:18.962036+01:00 morrowind kernel: [  156.235874] RAX: ffffffffffffffda RBX: 000000000000001e RCX: 00000000c0104629
2023-12-17T23:12:18.962036+01:00 morrowind kernel: [  156.235875] RDX: 00000000007ff750 RSI: 00000000f7e1dff4 RDI: 00000000007ff750
2023-12-17T23:12:18.962037+01:00 morrowind kernel: [  156.235876] RBP: 0000000000000000 R08: 00000000007ff674 R09: 0000000000000000
2023-12-17T23:12:18.962037+01:00 morrowind kernel: [  156.235877] R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
2023-12-17T23:12:18.962037+01:00 morrowind kernel: [  156.235878] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
2023-12-17T23:12:18.962037+01:00 morrowind kernel: [  156.235881]  </TASK>
2023-12-17T23:12:18.962037+01:00 morrowind kernel: [  156.235882] Modules linked in: nvidia_uvm(POE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_connt>
2023-12-17T23:12:18.962038+01:00 morrowind kernel: [  156.235929]  efi_pstore configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_gener>
2023-12-17T23:12:18.962038+01:00 morrowind kernel: [  156.235956] CR2: 000000001bd83018
2023-12-17T23:12:18.962039+01:00 morrowind kernel: [  156.235958] ---[ end trace 0000000000000000 ]---

It seems related to VRAM usage. When game is in “Ultra settings” for texture, the crash occurs. In “Low settings” issue seems not occurs.

Hi @poupouille
I am unfortunately not able to duplicate issue locally after trying steps in your earlier comments.
I will spend few more cycles on few other systems and update.
I have also filed a bug 4464466 internally for tracking purpose.

1 Like

Hi @poupouille
Just wanted to know if you have any other steps which have reproduced the same issue.
Because I am still not able to duplicate issue with the earlier steps share by you.

Hi @amrits ,

I worked around the issue in Diablo 4 by playing with lower resolutions which ate less VRAM. In texture at maximum capacity the crash was occuring. I did not retest.
As I gave you the call stack, you should be able to find the issue ;)
Or maybe the issue is on my setup, but no other games crash the system like that.

Hi @poupouille
I spent multiple hours to repro issue with driver 550.54.14 and reported driver 535.129.03 on below setup but still no luck.
Dell Alienware Aurora R15 AMD + AMD Ryzen 9 7900X 12-Core Processor + Ubuntu 22.04.2 LTS + kernel 5.19.0-32-generic + NVIDIA GeForce RTX 3080 + Driver 535.129.03 + DELL G3223D Display 2560x1440 with refresh rate 60Hz

Steps Tried -

  1. Logged in system and launched steam game Diablo IV game in “Ultra settings” for texture.
  2. Kept it running for an hour or so and ran Unigine benchmark to increase VRAM usage and later closed all apps.
    root@oemqa-Alienware-Aurora-R15-AMD:~# nvidia-smi
    Mon May 6 12:20:28 2024
    ±----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
    |-----------------------------------------±-----------------------±---------------------+
    | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |=========================================+========================+======================|
    | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 On | N/A |
    | 48% 69C P0 211W / 320W | 9865MiB / 10240MiB | 100% Default |
    | | | N/A |
    ±----------------------------------------±-----------------------±---------------------+

±----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1053 G /usr/lib/xorg/Xorg 382MiB |
| 0 N/A N/A 1536 G /usr/bin/gnome-shell 76MiB |
| 0 N/A N/A 9034 G ./heaven_x64 433MiB |
| 0 N/A N/A 9091 G ./heaven_x64 423MiB |
| 0 N/A N/A 9154 G ./heaven_x64 429MiB |
| 0 N/A N/A 9232 G ./heaven_x64 372MiB |
| 0 N/A N/A 10673 G ./steamwebhelper 5MiB |
| 0 N/A N/A 12118 C+G …apps\common\Diablo IV\Diablo IV.exe 7685MiB |
±----------------------------------------------------------------------------------------+
3) Then I suspend the system for sometime.
4) Upon resume, I restarted the apps and kept is running for sometime again and then rebooted it.
5) Repeated above steps couple of times but did not observe system freeze.

Could you please try once with latest released driver 550.78 and share test results.
https://us.download.nvidia.com/XFree86/Linux-x86_64/550.78/NVIDIA-Linux-x86_64-550.78.run

Hi @poupouille
Did you get a chance to test with latest released driver 550.78.

Hi @poupouille
Did you get a chance to test with latest released driver 550.78.

@amrits
Hello, I have the same issue in Dell G16 with RTX4060(Driver version is 550.78), 16 GB memory. I reported it here, though no official care. I recently found a way to 100% reproduce such an issue.

  1. Enable swap space with enough capacity.
  2. Use memetester to eat nearly all of your possible memory.
  3. Run multiple application that needs to use Nvidia to render.
  4. Wait, and Nvidia will crash finally.

If you unplug your charger and keep the battery at a low level(below 30%), it will be very quick to reproduce.

Hello @amrits

Sorry for late answer.
I’m running now 560.28.03 drivers, I did not observe system crash any more with this driver.
I have better performances with Diablo V latest game patch.

Regarding your test, you did not fill the VRAM completly : " 9865MiB / 10240MiB"

As you mention about VRAM usage, I observed issue in Diablo V and other applications, that when VRAM is getting full, the following issues are observed :

  • KWin (kde window manager) might crash
  • Slowness in Diablo V : It become slow after a while, I have to switch from High => Low => High to get best performance again. I notice that after VRAM as around 1Gb free. When it happens again, the free VRAM is near 0.
  • Out of memory for CUDA applications

I monitored VRAM usage with :

nvidia-smi --query-gpu=pstate,utilization.gpu,memory.free,memory.used --format=csv -l 5

Topic about that : Non-existent shared VRAM on NVIDIA Linux drivers - #33 by fvalasiad

Is it possible you answer on this point ? Around the web, a lot of people claimed that nvidia_uvm is not working.
It means that there is no shared memory between GPU and CPU to offload VRAM when it’s getting full.

Maybe the crash I observed initially what related to this ?

Thank you.

(I will monitor this thread, to answer you fast)