Freeze with v370.28

I experienced a system freeze on a GTX 970 with driver v370.28 this morning.

I was running a Second Life viewer for some time and the whole system froze (display frozen, keyboard unresponsive, etc) for 10 seconds or so, before things returned almost to normal, but with the card down-clocked to 539MHz. The following message got dumped into /var/log/messages:

Oct 21 11:23:48 localhost klogd: NVRM: GPU at PCI:0000:01:00: GPU-9cf0476e-4dbe-8c0f-4352-800be7075c41
Oct 21 11:23:48 localhost klogd: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000010
Oct 21 11:23:51 localhost klogd: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Oct 21 11:23:53 localhost klogd: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Oct 21 11:23:53 localhost klogd: NVRM: Xid (PCI:0000:01:00): 50,  L2 -> L1

And when I logged off from Second Life later on, I got:

Oct 21 11:30:25 localhost klogd: WARNING: CPU: 3 PID: 12049 at lib/vsprintf.c:1900 format_decode+0x3ac/0x3d0
Oct 21 11:30:25 localhost klogd: Please remove unsupported %{ in format string
Oct 21 11:30:25 localhost klogd: Modules linked in: nvidia_modeset(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia(PO) nvidia_drm(PO)
Oct 21 11:30:25 localhost klogd: CPU: 3 PID: 12049 Comm: cat Tainted: P           O    4.8.3 #1
Oct 21 11:30:25 localhost klogd: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Extreme4 Gen3, BIOS P2.30 06/29/2012
Oct 21 11:30:25 localhost klogd:  0000000000000286 0000000000000000 ffffffff804bd068 0000000000000007
Oct 21 11:30:25 localhost klogd:  ffff8806e5c5fc58 0000000000000000 ffffffff8028df7a ffff8806d616e4e0
Oct 21 11:30:25 localhost klogd:  ffff8806e5c5fd00 ffffffff80a0006a ffff8807f7b88000 ffff8806e5c5fd60
Oct 21 11:30:25 localhost klogd: Call Trace:
Oct 21 11:30:25 localhost klogd:  [<ffffffff804bd068>] ? dump_stack+0x47/0x5f
Oct 21 11:30:25 localhost klogd:  [<ffffffff8028df7a>] ? __warn+0xea/0x110
Oct 21 11:30:25 localhost klogd:  [<ffffffff8028e058>] ? warn_slowpath_fmt+0x48/0x50
Oct 21 11:30:25 localhost klogd:  [<ffffffff80305384>] ? get_page_from_freelist+0x234/0x7b0
Oct 21 11:30:25 localhost klogd:  [<ffffffff804c604c>] ? format_decode+0x3ac/0x3d0
Oct 21 11:30:25 localhost klogd:  [<ffffffff804c8075>] ? vsnprintf+0x65/0x560
Oct 21 11:30:25 localhost klogd:  [<ffffffff803722cb>] ? seq_vprintf+0x2b/0x50
Oct 21 11:30:25 localhost klogd:  [<ffffffff8037232e>] ? seq_printf+0x3e/0x50
Oct 21 11:30:25 localhost klogd:  [<ffffffff803ad548>] ? version_proc_show+0x38/0x40
Oct 21 11:30:25 localhost klogd:  [<ffffffff8037262f>] ? seq_read+0x12f/0x3b0
Oct 21 11:30:25 localhost klogd:  [<ffffffff80332eb1>] ? anon_vma_prepare+0x31/0x180
Oct 21 11:30:25 localhost klogd:  [<ffffffff803a511d>] ? proc_reg_read+0x3d/0x70
Oct 21 11:30:25 localhost klogd:  [<ffffffff80350cfe>] ? __vfs_read+0x1e/0x110
Oct 21 11:30:25 localhost klogd:  [<ffffffff8031958c>] ? vm_mmap_pgoff+0xbc/0xe0
Oct 21 11:30:25 localhost klogd:  [<ffffffff803523c2>] ? vfs_read+0xa2/0x130
Oct 21 11:30:25 localhost klogd:  [<ffffffff8035249b>] ? SyS_read+0x4b/0xc0
Oct 21 11:30:25 localhost klogd:  [<ffffffff8084771b>] ? entry_SYSCALL_64_fastpath+0x13/0x8f
Oct 21 11:30:25 localhost klogd: ---[ end trace cb66c20254363bd2 ]---

I updated yesterday from Linux kernel (vanilla) v4.8.2 to v4.8.3, so this may perhaps be the reason for such a weird bug, that I never got confronted with in the past month I have been running driver v370.28.

Also, I noticed that after that freeze, the Mate “command” applet which runs every 15 seconds a personal “gpustat” script (using nvidia-smi) to display the GPUs temperature, fan speed, etc in the Mate panel, was not reporting anything any more (i.e. nvidia-smi was no more working properly).

I’m also attaching the traditional nvidia-bug-report.log.gz
nvidia-bug-report.log.gz (71.8 KB)

Xid 8 means the GPU stopped responding. Does this happen frequently? If it just happened once, it was probably just a fluke; power supply glitch, temperature-related instability, stray cosmic ray, etc.

The “%{” format string thing is weird. It doesn’t appear in any format string in the driver that I can find. I wonder if something corrupted the contents of your RAM and that caused both symptoms.

It was the first time it occurred.

My system is rock-stable (never any crash/freeze, even during long term high load usages, such as days-long 4-cores AVX + GPU number-crunching with BOINC) and kept functioning after the freeze, even thought the GPU locked itself at 539MHz and refused to budge from there (safety mode coded in the driver ?).
The Second Life viewer itself is not even an heavy load for the GTX 970 (fans at 50% or so, GPU temp around 60°C).

The cosmic ray explanation is plausible, I suppose, even though the evil act of a Gremlin would not be a less plausible cause… :-D

In any case, I updated to the v375.10 beta driver today… Let’s see how it will fare (currently running a burn-in test with Unigine Heaven).

It actually came from the gcc version string: a “%{vendor}” macro string coming from the RPM .spec file made its way in the version string by mistake ("%{_vendor}" is the valid macro to use for this distribution). I fixed that on my system, but it’s totally unrelated to any driver bug, and definitely not a memory corruption issue.