Driver 545.29.06 crashes seemingly randomly while playing games with Proton/Wine, sometimes too on login screen on pop-os

Hey everyone!

One or two months ago I switched to Linux permanently. I’ve been playing games using steam Proton and (more recently for testing) wine. I’ve mostly been playing the Division and Cyberpunk 2077. Pretty much every time I play, after various amounts of time, the Nivida X driver crashes, freezing my whole system forcing me to force reboot the computer. Every now and then it sends me back to the “cryptdata setup success” screen and locks up. I’m seeing a variety of errors in the logs, but it looks like it usually starts with Xid 62. I’ve been monitoring my temperatures, and it usually hangs around 50 while idling with 0% fans, going up to around 75 under normal gaming load with around 60% fans.

It hasn’t crashed so far when I’m not playing games. I ran Unigine haven on max settings and that ran fine, no problems there.

When it does crash, it produces images like this:

With big artifacting (?) on the game screen, and some slight artifacting on other screens

Sometimes it does not show any artifacts and just freezes up.

I’m worried it’s my graphics card breaking, but I haven’t had it that long yet (bought 21st January 2022)

I’m using pop-os 22.04 LTS, with the system76-nvidia-driver version 545.29.06.
My graphics card is the Gigabyte GeForce RTX 3060 Ti GAMING OC 8G 2.0 and I’m using a Gigabyte GP-P750GM 750W PSU

I’ve been reading the logs, but nothing specific stands out to me, I’m really at a loss here. I’m hoping someone with a bit more knowledge and experience can take a look and help me out. I’d really appreciate it!

Please let me know if you need any more information, I’d be happy to provide anything else.
nvidia-bug-report.log.gz (450.1 KB)
nvidia-bug-report.log.old.gz (508.3 KB)

To check for HW faults, please run gpu-burn for 10 minutes and check its output.
Regarding CP2077, please check if this helps:
Though this was triggering an Xid 69.

I ran gpu-burn a couple times for 10 minutes, plainly, with -d mode, -tc mode, and -tc and -d mode. No errors popped up thankfully. Temps were hovering between 50 and 66, so that doesn’t seem like an issue either. I’ll try replacing the vkd3d version and seeing if that works and report back

Unfortunately replacing vkd3d did not work. It still crashed the system. I’ve also still been having issues with the divison. I used GitHub - GpuZelenograd/memtest_vulkan: Vulkan compute tool for testing video memory stability to test my system, and it’s been consistently crashing my system. Using the graphical environment this fully locked up the system, and with the TTY environment it reported errors but did not lock up the system, only partially breaking the display with text all jumbled up and some lines through the screen upon switching back to the graphical environment. I fear that there’s something wrong with the VRAM on my GPU. I’ve opened a discussion on the github repo, waiting to hear back. I’ve also created a nvidia bug report right after rebooting after running the memtest. Perhaps there’s something visible there, but nothing specific popped out for me. Either way, thank you for all your help so far :D.

memtest_vulkan.log (195.0 KB)
nvidia-bug-report.log.gz (554.1 KB)

That’s not looking good, seems as soon as the memory heats up, it’s failing.

Hopefully they can help interpreting the output.

I’ve confirmed this is an issue on Windows too. The card is luckily still in warranty, so I’m going to open an RMA request and get it replaced or repaired. Thank you so much for all your help, you’re amazing.