While playing game (Lutris/wine) computer crashes, displays are disconnected

nVidia RTX 3060Ti, driver 550.67 (latest from the distro).

On Pop! OS (latest updates installed), while playing games using Lutris (Windows games, using wine) displays got disconnected (“no signal” message), computer crashes (I know this because of the sound - last played one is looped and plays non-stop). It started ocuring about 2 days ago, before everything worked ok.

I have found such errors in logs from the time it crashed:

nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c67e:4:0:0x0000000f
NVRM: Xid (PCI:0000:01:00): 79, pid=‘’, name=, GPU has fallen off the bus.
NVRM: GPU at PCI:0000:01:00: GPU-18b3f147-bc78-3f10-613e-057910c70878

Otherwise it works normal while using 3D apps (Blender, Plasticity, …).

nvidia-bug-report.log.gz (523.0 KB)

EDIT: turned on iGPU and connected second monitor to it, to see what is happening when the crash occurs. GPU was around 50-55 deg C while playing the game, computer didn’t crashed, just the game - the display with nVidia got disconnected and GPU have “fallen out of the bus”. System temperatures were also around 45 C.
Under windows everything works perfectly, even with way more demanding games. Apps like Blender (generally, 3D) and GPU rendering works normally, both under Linux and Windows.

UPDATE: I found that if I switch PowerMizer setting from “Auto” to “Prefer Maximum Performance” - the crashes are not occuring, at least for now. Found it somewhere in the internet, it was referring to some other problem, but tried it anyway.

You’re getting loads of pcie errors from your nvme device, breaking the bus so the nvidia gpu flies off. Please try disabling aspm by setting kernel parameter pcie_aspm=off
If that doesn’t help, check your nvme connection, check for a bios update.

It have nothing to do with that NVME drive error - it’s behaving like that from forever - and the problem started occurring only three days ago. BIOS is the latest from Asus (motherboard: Asus ROG Strix Z790-I). Checked disk connection.

UPDATE: PowerMizer trick worked yesterday for some time. Today back to crashes.

UPDATE: I have removed all nvidia drivers and installed -server version.

  1. main problem still exists
  2. not getting errors about NVME anymore, I’m assuming those were caused by nvidia drivers (it’s basically the only change I have made)
  3. pcie_asmp=off didn’t fix the problem (nor nvme errors)

Then you’re down to Xid 79 standard procedures. Monitor temperatures, reseat power connectors/the card in its slot, check/replace PSU. If that doesn’t yield anything, the gpu is on its way out.

I would agree with that - if … there would be any issues under Windows. No crashes, problems, etc. Same games running flawlessly.
And I’m using this system mainly for work - big 3D projects and GPU rendering, still without any issues.

Gaming is neither graphics nor cuda, Windows is not Linux, but an Xid 79 is still an Xid 79.

What are you talking about? :D If the card works under windows without any problems - and NOT under Linux - it’s software issue. Also, under Linux I was playing rather not demanding titles, like WoT or Fallout 4 - with GPU temps around 50-60 deg C. Under windows it runs Cyberpunk without a glitch nor crashes :)

Xid 79 - at least by NVidia offical docs - can be hardware issue, right - but also might be a problem with the driver (source: XID Errors :: GPU Deployment and Management Documentation) - and everything points to second option.

No. Forget about that, it’s an illusion you’re talking yourself into.
It can be a driver issue, but only on notebooks in very, very, very rare cases, mostly model specific. You have a desktop., Xid 79 is hardware, always. You’re not the one in a billion case.

So why everything work correctly under Windows, even in more taxing workloads? There is some problem in Linux - either driver, or something with support for other components.
Can it be, for example, problem with motherboard support under linux? It’s specific one, as it’s ITX. One of nvme slots (the one that was throwing errors) can be connected directly to CPU PCIE or bridge - in the first case it supports Gen5 (but GPU is limited to 8x - it’s called bifurcation in bios), in second Gen4 and GPU have full 16x lanes to itself. I’m using second option.

Might be. Telling by the amount of ACPI errors in the log, the system bios doesn’t have a good quality. Checking for an update is always worth a shot.
If you can’t fix the pcie errors from the nvme, you should at least quieten them, setting pci=noaer as kernel parameter. The messages itself are also blocking the bus. To check for PSU issues, you can try limiting clocks to avoid boost spikes (the linux driver is clocking more aggressively than the Windows driver) e.g. nvidia-smi -lgc 300,1400

As I mentioned before, those nvme errors stopped when I purged nvidia driver from the system and installed -server version. So there is something going on with the driver. BIOS is the latest one, 2202 from 9 days ago.

nvidia-smi … - how to check current values it’s using? can’t find that option in --help

EDIT: PSU is quality 850W SFX - overkill for 3060Ti and i7-13 series, should be good even for 3090.

Wait a second… Your suggestion about nvidia-smi and clocking lead me to checking something.

I’ve checked what clock speeds are used in NVidia Settings: PowerMizer is showing 270-2160 MHz at level 4 (max). Then I checked specs of the card manufacturer (Gigabyte RTX 3060Ti Eagle): it states that max boost is 1695 MHz…

Is this me, or the driver is applying WAY to high clocks for my GPU? Overclock from ~1700 to 2160 MHz is huge…

UPDATE: after applying limits (sudo nvidia-smi -lgc 270,1695) no more crashes (for now), and … FPS jumped from 160-190 to ~320-350. There is definitely something wrong with the clock speeds the driver is applying by default…

Starting with Turing, “max clocks”! displayed by nvidia-smi and nvidia-settings are merely theoretical limits, never used and reached.

Ummm… I don’t fully understand what you mean - the PowerMizer displays actual clock speeds (Graphics clock: …) And according to those my GPU reached this maximum of 2160 MHz when it was crashing - and by the manufacturer specs, it NEVER should.

Now, after applying nvidia-smi -lgc 270,1695 it’s reaching max of 1695 MHz… and no crashes.

How to make this setting permanent, even after restarts?

You’ll have to create a systemd unit for it.
If it’s crashing when boosting to high clocks, the psu is breaking down on power spikes.

Sorry, but this is bullshit - I was using this PSU before and I was overclocking CPU (reached 5.1 at all cores without any issues) - it can withstand much more spikes. And for i7 and 3060Ti it’s overkill already at 850W.

The real problem lies in the clock speeds - why driver is applying such high clocks to GPU? I’ve checked it under Windows, and GPU clock never exceeded 1750 MHz.
This could lead to just killing GPU. We are talking about HUGE overclock from ~1700 to ~2150 MHz…

Do what you want, I’m out.