Severe stability issue with nvidia 367.18 driver

Card is:
01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 630] (rev a1)

My system crashes a few minutes after logging onto Gnome. This happens on its own. But I can make it crash earlier.

To force a crash, either:

  1. Play a youtube video (html5) and the system will lock up at the end of the video.
  2. alt-f2 -> r in gnome-shell a few times and the system will crash even when no other applications are running.

The only way to recover is alt+SysRq and force a reboot (this also remounts my root disk as read-only). The bad thing is the journal is not flushed to disk since so getting logs is difficult. I do recall an XID error on the screen.

Downgrading to 364.19 fully solves the stability issues.
This is a Fermi card by the way.

nvidia-bug-report.log.gz (52.5 KB)

HussamT, Please attach nvidia bug report

It won’t let me attach the file
Here is a copy https://www.dropbox.com/s/2f241owa5d6ry8f/nvidia-bug-report-20160522.log.gz?dl=0
Please let me know if it works.

Looks like you are using the testing repos on Arch. I use Arch with the stable repos and also use GNOME with the newest Beta driver and a Fermi card (GTX 580) and could not reproduce your issues. Could you try switching to the normal repos and do a pacman -Suu?

The XID error was 32. According to https://docs.nvidia.com/deploy/xid-errors/, it means:

32 Invalid or corrupted push buffer stream.

My system is compiled from source using ArchLinux PKGBUILDs (I have been doing so since 2006). I guess you mean downgrade to kernel 4.5? I need kernel 4.6 for other fixes.

The 364.19 PKGBUILD is edited and bumpted to 367.18 and I verified the installation (I know my way around these things. over 20 years of Linux experience).

Another very easy to reproduce this. Open a 3d game. Alt+f2 and r.
Xorg locks up with XID: 32 error.

Xorg crash data:
Timestamp: Sun 2016-05-22 13:54:41 EEST (3min 0s ago)
Command Line: /usr/lib/xorg-server/Xorg vt1 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose
Executable: /usr/lib/xorg-server/Xorg
Control Group: /user.slice/user-1000.slice/session-c1.scope
Unit: session-c1.scope
Slice: user-1000.slice
Session: c1
Owner UID: 1000 (hussam)
Boot ID: 89fd4494251a464da881b0fa9360478e
Machine ID: efb490e643e2436d9d1138df1745a008
Hostname: hades
Coredump: /var/lib/systemd/coredump/core.Xorg.1000.89fd4494251a464da881b0fa9360478e.476.1463914481000000000000.lz4
Message: Process 476 (Xorg) of user 1000 dumped core.

            Stack trace of thread 476:
            #0  0x00007ff7710ce275 raise (libc.so.6)
            #1  0x00007ff7710cf68a abort (libc.so.6)
            #2  0x000000000059658e OsAbort (Xorg)
            #3  0x000000000059cff7 FatalError (Xorg)
            #4  0x0000000000593e3e n/a (Xorg)
            #5  0x00007ff7710ce2f0 __restore_rt (libc.so.6)
            #6  0x00007ff76b5db2e3 n/a (nvidia_drv.so)
            #7  0x00007ff76baa907a n/a (nvidia_drv.so)

(gdb) bt full
#0 0x00007ff7710ce275 in raise () from /usr/lib/libc.so.6
No symbol table info available.
#1 0x00007ff7710cf68a in abort () from /usr/lib/libc.so.6
No symbol table info available.
#2 0x000000000059658e in OsAbort ()
No symbol table info available.
#3 0x000000000059cff7 in FatalError ()
No symbol table info available.
#4 0x0000000000593e3e in ?? ()
No symbol table info available.
#5
No symbol table info available.
#6 0x00007ff76b5db2e3 in ?? () from /usr/lib/xorg/modules/drivers/nvidia_drv.so
No symbol table info available.
#7 0x00007ff76baa907a in ?? () from /usr/lib/xorg/modules/drivers/nvidia_drv.so
No symbol table info available.
#8 0x00007ffd6e17b9a8 in ?? ()
No symbol table info available.
#9 0x01007ff772fd1ce1 in ?? ()
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

I only saw that your kernel is newer that the one in [core] do you use the PKGBUILDs from [core] and [extra] etc. for the rest of the system?

The ones in /trunk/ so should be the latest. ‘Testing’ in Arch Linux is not a whole distribution but just short lived branches for a limited number of packages.
Anyway, reverting to 364.19 made things stable again (even on kernel 4.6) so I don’t think the rest of the system is the cause.

Anyway, can you test something for me please? Run two instances of an OpenGL game (or two different windowed opengl games), a youtube video, and alt+f2 > r to reexec gnome-shell.
367.18 seems to not handle this particular scenario whole 364.19 does.

Just did that with neverball, neverputt and chromium/youtube open in windows so that they are all visible. Tried it 2 times and it worked just fine.

Ok, I use Firefox but both use ffmpeg/libvpx so that’s not the cause.
What toolkit do neverball and neverputt use? The games I was trying use SDL 1.2 for graphics (to abstract opengl usage on windows/linux/osx,etc…).

Thanks for the reply. So we narrowed it down to SDL1 and gnome-shell.

Tried the same with Alien Arena, which is SDL 1.2 and everything still works fine.

As I stated at the beginning, games only force-reproduce the crash. It was crashing on itself a few minutes later anyway without firefox/games.

Blackout24, I feel you are trying to prove my crash is not valid and that’s wrong. Are you a NVIDIA developer? If not, I am not interested in your input.
I’m not looking for “community” input. It doesn’t help me. I’m here to report a bug.
Unless you are a NVIDIA developer, I have no interest in your input.
I really wish NVIDIA had a bug tracker so I could communicate directly with developers.

All you are accomplishing here, blackout24, is make it harder for real NVIDIA developers to read my thread.

We also run different cards. Yours is a real Fermi card. Mine is one of those weird in-between technologies but is also a Fermi card. Some 630GT cards are not Fermi. Your card works with nouveau for example. Mine doesn’t.
For example, I never experienced the crash described here https://bugs.archlinux.org/task/48772. I also never had gnome-shell leaks.

Anyway, I will leave it NVIDIA developers to fix this. But I am sure it is related to the Xorg driver and running a few applications doing the same thing at the same time.

The crash happens also while switching TTY and back to TTY1.

HussamT, We are tracking this issue under bug 200202593

Thank you very much Sandip.

The bug report file you have provided is one of before the issue happened. Once the system appears to be crash, can you still SSH into it and type commands? The answer is not unlikely to be yes. In that case, can you please reproduce the issue and then generate a bug report file?
Thanks

The bug report file is before the issue happened.
I will upgrade to the beta driver again and attempt to generate a bug report file using ssh after it crashes. (luckily I have an android device that has ssh installed)

The following files were generated after the crash by running nvidia-bug-report.sh via ssh.

Please let me know if they download correctly. I can also email them as well if you wish.
Thank you very much.

What you are describing sounds related to this bug:

https://bugs.archlinux.org/task/48772