Several GL applications hang with "NVRM: Xid (0000:01:00): 13" [331.49/GTX780]

Hi.

I’m getting hangs with these “Xid” messages on my new PC, which has a Gigabyte GTX 780, whereby a single application — usually a game, though I suspect this might have hit KWin or X11 at some point, as I’ve had a few system-wide hangs — will have its graphics hang and one of the messages below appear in the kernel log. Sometimes a new frame (or part thereof) will be rendered, though it will sometimes have graphical “glitches”, either as though only part of the display (usually in a few contiguous rectangular regions) will update (typically the very edge of the screen will not).

While I’ve definitely had this issue in the recent port of Portal 2, it appears very quickly (within a minute or two of starting) and reproducibly in both “Crusader Kings 2” and Double Fine’s “Steed” prototype from the 2014 “Amnesia Fortnight”, running under wine 1.7.13. Most other OpenGL programs (including many games, albeit usually less taxing ones) can run for a considerable time with no such problems.

The messages which appear are of the form:

NVRM: GPU at 0000:01:00: GPU-51e8de7f-e984-ff21-7f5e-e9ef3d2d36fc
NVRM: Xid (0000:01:00): 13, 0008 00000000 0000a197 000017d8 00000203 0000000c
NVRM: Xid (0000:01:00): 32, Channel ID 0000000c intr 00040000
NVRM: Xid (0000:01:00): 13, 000c 00000000 0000a197 000017d8 00010001 0000000c
NVRM: Xid (0000:01:00): 13, 000c 00000000 0000a197 000017d8 00000011 0000000c

The nvidia kernel module also complains about a lack of VGA console, and VT switching does not work. The system boots from EFI, and I’ve been unable to find a way of getting a VGA compatible console out of it.

At one point, during the shutdown procedure, I got a:

BUG: soft lockup - CPU#0 stuck for 22s! [X:1041]
Modules linked in: ip6t_rpfilter bnep ip6t_REJECT bluetooth cfg80211 xt_conntrack rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_c
 binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci mxm_wmi drm firewire_core crc_itu_t i2c_core wmi video
Mar 08 21:01:01 sparky kernel: CPU: 0 PID: 1041 Comm: X Tainted: PF         IO 3.13.5-202.fc20.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD5H/Z87X-UD5H-CF, BIOS F8 01/17/2014
task: ffff8808053488a0 ti: ffff8807fa462000 task.ti: ffff8807fa462000
RIP: 0010:[<ffffffffa0a7027e>]  [<ffffffffa0a7027e>] os_io_write_dword+0xe/0x10 [nvidia]
RSP: 0018:ffff8807fa463c20  EFLAGS: 00000286
RAX: 0000000000009400 RBX: 0000000000000001 RCX: ffffffffa0e91bb0
RDX: 000000000000e008 RSI: 0000000000009400 RDI: 000000000000e008
RBP: ffff8807fa463c20 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
R13: ffff8807f83b2f28 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f79642389c0(0000) GS:ffff88083f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f79634e8850 CR3: 000000080f125000 CR4: 00000000001407f0
Stack:
 ffff8807f83b2f28 ffffffffa0a5019e ffffffffa0e96c84 ffffffffa0a5c633
 ffff8807f83b2f80 ffffffffa0a53216 ffff8807fa454008 ffffffffa0a50726
 ffff8807fa454008 ffff8807f83b2f84 0000000000004f02 0000000000000001
Call Trace:
 [<ffffffffa0a5019e>] rm_shutdown_gvi_device+0x106/0x290 [nvidia]
 [<ffffffffa0a5c633>] ? _nv017101rm+0x8746/0xcee3 [nvidia]
 [<ffffffffa0a53216>] ? _nv000956rm+0x83/0xa4 [nvidia]
 [<ffffffffa0a50726>] ? _nv012910rm+0x19d/0x9f0 [nvidia]
 [<ffffffffa0a3ff3a>] ? _nv013192rm+0x8c/0x16d [nvidia]
 [<ffffffffa0a444d9>] ? _nv000840rm+0x359/0x3c9 [nvidia]
 [<ffffffffa0a4446b>] ? _nv000840rm+0x2eb/0x3c9 [nvidia]
 [<ffffffffa0a449fe>] ? _nv000763rm+0x4b5/0x552 [nvidia]
 [<ffffffffa0a46e4a>] ? _nv014930rm+0x99/0xbb [nvidia]
 [<ffffffffa0a3d4ff>] ? _nv000818rm+0x44f/0x9d7 [nvidia]
 [<ffffffffa0a46d27>] ? rm_ioctl+0x76/0x100 [nvidia]
 [<ffffffffa0a70b00>] ? os_pci_read_byte+0x10/0x40 [nvidia]
 [<ffffffffa0a65577>] ? nvidia_ioctl+0x147/0x480 [nvidia]
 [<ffffffffa0a7237f>] ? nvidia_frontend_ioctl+0x2f/0x70 [nvidia]
 [<ffffffffa0a723e1>] ? nvidia_frontend_unlocked_ioctl+0x21/0x30 [nvidia]
 [<ffffffff811cb6f8>] ? do_vfs_ioctl+0x2d8/0x4a0
 [<ffffffff811ba79e>] ? ____fput+0xe/0x10
 [<ffffffff811cb941>] ? SyS_ioctl+0x81/0xa0
 [<ffffffff816919fe>] ? do_page_fault+0xe/0x10
 [<ffffffff81695f69>] ? system_call_fastpath+0x16/0x1b
Code: 00 00 55 89 f0 89 fa 48 89 e5 66 ef 5d c3 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 89 f0 89 fa 48 89 e5 ef <5d> c3 0f 1f 44 00 00 55 89 fa 48 89

The system has got other “BUG: Soft Lockup” issues which do not have the nvidia module appear in their stack traces and which may not be nVidia related at all: the system is new and could have other, as yet unidentified issues.

I’ve seen this issue on Fedora 20. The machine has had a complete, exhaustive memory test with MemTest86+ which showed no issues. The graphics card does not appear to be overheating, and is otherwise working well. The system has two 1440x900 monitors, connected over DVI.

The system also has an intel integrated GPU (it is a desktop with an Intel Core i7 4770K CPU). I’ve disabled this in the UEFI BIOS, and am booting with intel_iommu=off, but it does not seem to help. I’ve also reseated the GPU, again this had no effect.

The output from nvidia-bug-report.sh is located here:
http://davidgow.net/stuff/nvidia-bug-report.log.gz

Please let me know if there is any more information I can provide, or if this is a known issue or a hardware fault: I am anxious to get this issue resolved.

— David