396.54 crashes big time during logout on KUbuntu 18.04

During logout (this happens once in 3 times or so)

  • The system get seriously frozen.
  • nvidia-bug-report.sh gets stuck
  • XID 44 seems to be the culprit
  • ssh reboot gets stuck while waiting on a frozen nvidia process
  • top reports: 100% CPU for irq/152-nvidia

Since I had to kill nvidia-bug-report.sh for being stuck, things seem to be missing.

I have added kern.log’s where long nvidia-related call traces can be seen (please ignore my USB complains due to SD reader being naughty).
nvidia-bug-report.log.gz (76 KB)
kern.tar.gz (393 KB)

You’re also getting XIDs 8, 31, 32. Tried moving the graphics card to a different slot?

Indeed, I always had those under different occasions. In my investigations I replaced motherboard+cpu and nothing has changed. I should add that I received this card back last week after it got refurbished (still don’t know what was fixed) as XID 79s were unbearable (thus tried changing motherboard = no luck, sent to warranty). At least 79s are gone now.

I should correct myself, 32 is new. Happened during a vulkan-based game (Talos Principle).

OK, I have learned from the service that I did receive a new board (bundled in an old box; I see new serials), so this is drivers 100% for sure.

Just for the record, while waiting for a new board, I used integrated intel for 2 months. I had ZERO problems, none.

I hope someone can work on this issue, otherwise dumping NVIDIA is the only solution. So far, 1 year of driver problems, luckily new Vega/Navi should be out in a year from now.

Could you please delete your xorg.conf, it has two device sections. When you said you replace the mb, did you replace it with the same model?

Change was the following: MSI (Z170) + 6700 -> ASUS (Z370) + 8700K

Here you go, two hot out of the oven crashes with and without xorg.conf.

Interestingly, I tried logging out first - went ok (tried just once for now), tried second time - crashed. However, there was this peculiarity about the second time:
0. Login in

  1. Steam
  2. Video game (Arma 3 in this case)
  3. logout
  4. crash

Looks like as if the driver is in an ill state after the video game.
nvidia-bug-report.log.gz (77.4 KB)
kern.tar.gz (130 KB)

Can you reproduce it also by switching to VT and back instead of logout?
When replacing the board, did you also replace the memory? If not, test it by pulling all modules but one, if it still crashes, try the next module alone.
Don’t use memtest86 or the like, those things are useless with modern memory, will only report errors if the mem is really, really broken.

OK, I will do these tests and come back. Although I am not buying the RAM as the cause: with pure integrated intel gfx the system was the first time ever truly stable. It must be either GPU FW or the driver.

I did notice, however, that the crashes are not 100% reproducible. I tried some more pure login&outs, and they were fine, also repeated the game sequence and the logout survived. Will report with more findings later.

Going to TTY saves the day, i.e., this is the sequence:

  1. X
  2. TTY
  3. X (compositors gets a reset here - usual NVIDIA-related issue, no such thing with intel)
  4. logout
  5. success (no XID 44)