X sometimes hangs at 100% CPU with GTX 960 when switching resolution

Sometimes when I switch resolution, for example when starting an old game that only runs at a low resolution, X hangs and starts taking up 100% CPU. I have two monitors, a Dell U2412M (primary, DVI) and a Dell P2415Q (secondary, DP). When switching to e.g. 800x600 the secondary turns off as it should, but the primary displays an 800x600 black square at the upper left corner while the rest of the screen remains as it was before trying to switch.

Trying to switch resolution after the computer has been on for a day or so almost always triggers this issue, but it can sometimes happen earlier also. Right after rebooting the machine I can usually switch resolution as much as I want without any issues.

I can SSH into the machine, but X can’t be killed and trying to reboot it just hangs the machine.

The system log says the following:

jul 11 16:19:51 awori kernel: INFO: task kworker/10:1:116 blocked for more than 120 seconds.
jul 11 16:19:51 awori kernel:       Tainted: P           O    4.6.3-1-ARCH #1
jul 11 16:19:51 awori kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jul 11 16:19:51 awori kernel: kworker/10:1    D ffff8800b602bcb8     0   116      2 0x00000000
jul 11 16:19:51 awori kernel: Workqueue: events nvkms_workqueue_callback [nvidia_modeset]
jul 11 16:19:51 awori kernel:  ffff8800b602bcb8 00ffffff810c41d8 ffff88081b81b900 ffff8800b6020000
jul 11 16:19:51 awori kernel:  ffff8800b602c000 ffffffffa08d87c0 ffff8800b6020000 0000000000000000
jul 11 16:19:51 awori kernel:  ffff8807b5f57600 ffff8800b602bcd0 ffffffff815c372c 7fffffffffffffff
jul 11 16:19:51 awori kernel: Call Trace:
jul 11 16:19:51 awori kernel:  [<ffffffff815c372c>] schedule+0x3c/0x90
jul 11 16:19:51 awori kernel:  [<ffffffff815c6203>] schedule_timeout+0x1d3/0x260
jul 11 16:19:51 awori kernel:  [<ffffffff812f99c8>] ? find_next_bit+0x18/0x20
jul 11 16:19:51 awori kernel:  [<ffffffff81190739>] ? next_zone+0x29/0x30
jul 11 16:19:51 awori kernel:  [<ffffffff815c51a6>] __down+0x76/0xc0
jul 11 16:19:51 awori kernel:  [<ffffffff810e779e>] ? try_to_del_timer_sync+0x5e/0x90
jul 11 16:19:51 awori kernel:  [<ffffffff810c4611>] down+0x41/0x50
jul 11 16:19:51 awori kernel:  [<ffffffffa08327de>] nvkms_workqueue_callback+0x6e/0xf0 [nvidia_modeset]
jul 11 16:19:51 awori kernel:  [<ffffffff81093a05>] process_one_work+0x1e5/0x480
jul 11 16:19:51 awori kernel:  [<ffffffff81093ce8>] worker_thread+0x48/0x4e0
jul 11 16:19:51 awori kernel:  [<ffffffff81093ca0>] ? process_one_work+0x480/0x480
jul 11 16:19:51 awori kernel:  [<ffffffff81093ca0>] ? process_one_work+0x480/0x480
jul 11 16:19:51 awori kernel:  [<ffffffff81099998>] kthread+0xd8/0xf0
jul 11 16:19:51 awori kernel:  [<ffffffff815c73c2>] ret_from_fork+0x22/0x40
jul 11 16:19:51 awori kernel:  [<ffffffff810998c0>] ? kthread_worker_fn+0x170/0x170

Some machine specs:
Core i7 5820K @ stock
Asus X99-A
32GB DDR4 2400MHz
Asus GTX 960 Strix 2GB

The machine is running ArchLinux, with Linux 4.6.3 and NVIDIA drivers 367.27. I’ve tried turning KMS on as per the instructions on the ArchLinux wiki, but it made no difference so I turned it off again (the error log is made with KMS turned off). Let me know if you need any more information.

Edit: Actually, I realised I was wrong when describing the switching. The secondary monitor shouldn’t turn off, what it normally does is to just flash black and then return to it’s previous state (I only use the primary monitor to game on and thus turns off the secondary by reflex, hence why I got a bit confused). When this issue occurs this does not happen, the secondary monitor then just remains black.
nvidia-bug-report.log.gz (189 KB)

We have similar problem with Gigabyte GTX 960 and drivers 375.26, 375.26 and
367.57

We use those CARDs for H264 decoding -> H264 encoding CUDA resize and deinterlace with ffmpeg. And it sometimes hangs, some stations hangs every second day, some of them hangs once per week, longest uptime is 14 days.

We tested GTX 1070 and GTX 1080 and they works on exactly same system with same drivers and applications without any problems. So I think that it is some memory leak in graphic driver or instability in CUDA cores.

That is happening on all of our 25pcs Gigabyte GTX 960 cards (we don’t have any other vendor), AMD/Intel Supermicro servers or plain computers!

Device Id : 0x140110DE
Bus Id : 0000:04:00.0
Sub System Id : 0x36C11458

Jan  3 20:17:31 shi-node4 kernel: [938938.049862] INFO: task ffmpeg-3.2-2016:2003909 blocked for more than 120 seconds.
Jan  3 20:17:31 shi-node4 kernel: [938938.049883]       Tainted: P           O  3.16.0-4-amd64 #1
Jan  3 20:17:31 shi-node4 kernel: [938938.049908] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan  3 20:17:31 shi-node4 kernel: [938938.049932] ffmpeg-3.2-2016 D ffff8816c8a6c6a8     0 2003909      1 0x00000006
Jan  3 20:17:31 shi-node4 kernel: [938938.049941]  ffff8816c8a6c250 0000000000000046 0000000000012f40 ffff8815733f3fd8
Jan  3 20:17:31 shi-node4 kernel: [938938.049946]  0000000000012f40 ffff8816c8a6c250 ffff8808332368c8 ffff8815733f3ae8
Jan  3 20:17:31 shi-node4 kernel: [938938.049952]  7fffffffffffffff 0000000000000002 0000000000000000 ffff8816c8a6c250
Jan  3 20:17:31 shi-node4 kernel: [938938.049957] Call Trace:
Jan  3 20:17:31 shi-node4 kernel: [938938.049973]  [<ffffffff81514289>] ? schedule_timeout+0x259/0x2d0
Jan  3 20:17:31 shi-node4 kernel: [938938.049981]  [<ffffffff8108d7a6>] ? atomic_notifier_call_chain+0x16/0x20
Jan  3 20:17:31 shi-node4 kernel: [938938.049987]  [<ffffffff81095e43>] ? set_task_cpu+0xf3/0x1c0
Jan  3 20:17:31 shi-node4 kernel: [938938.049995]  [<ffffffff812acf20>] ? cpumask_next_and+0x30/0x40
Jan  3 20:17:31 shi-node4 kernel: [938938.050001]  [<ffffffff8109d709>] ? select_task_rq_fair+0x6e9/0x700
Jan  3 20:17:31 shi-node4 kernel: [938938.050008]  [<ffffffff81516dea>] ? __down_common+0xa0/0xf3
Jan  3 20:17:31 shi-node4 kernel: [938938.050312]  [<ffffffffa1707209>] ? os_get_current_time+0x19/0x30 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.050320]  [<ffffffff810adbbb>] ? down+0x3b/0x50
Jan  3 20:17:31 shi-node4 kernel: [938938.050488]  [<ffffffffa1706da3>] ? os_acquire_mutex+0x43/0x50 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.050695]  [<ffffffffa1ca0f28>] ? _nv017461rm+0x18/0x30 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.050915]  [<ffffffffa1c3adbd>] ? _nv019673rm+0x3d/0x120 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.051107]  [<ffffffffa1ca7e76>] ? rm_free_unused_clients+0x56/0xf0 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.051251]  [<ffffffffa16fe212>] ? nvidia_close+0x202/0x320 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.051409]  [<ffffffffa16fb3d7>] ? nvidia_frontend_close+0x27/0x50 [nvidia]
Jan  3 20:17:31 shi-node4 kernel: [938938.051427]  [<ffffffff811ac1aa>] ? __fput+0xca/0x1d0
Jan  3 20:17:31 shi-node4 kernel: [938938.051438]  [<ffffffff8108670c>] ? task_work_run+0x8c/0xb0
Jan  3 20:17:31 shi-node4 kernel: [938938.051445]  [<ffffffff8106ac61>] ? do_exit+0x2b1/0xa70
Jan  3 20:17:31 shi-node4 kernel: [938938.051451]  [<ffffffff8106b499>] ? do_group_exit+0x39/0xa0
Jan  3 20:17:31 shi-node4 kernel: [938938.051457]  [<ffffffff81079958>] ? get_signal_to_deliver+0x1c8/0x5d0
Jan  3 20:17:31 shi-node4 kernel: [938938.051475]  [<ffffffff81013492>] ? do_signal+0x42/0xa10
Jan  3 20:17:31 shi-node4 kernel: [938938.051528]  [<ffffffff81013ed8>] ? do_notify_resume+0x78/0xa0
Jan  3 20:17:31 shi-node4 kernel: [938938.051535]  [<ffffffff8151884a>] ? int_signal+0x12/0x17

After this crash i am unable to rmmod driver, unable to reset driver through nvidia-smi -r, only working solution is reboot :(

@perost

“Optimal Resolution:
1920 x 1200 at 60 Hz”

Dell 24 Monitor | U2412M | Dell
http://www.dell.com/us/business/p/dell-u2412m/pd

“Native Resolution
4K 3840 x 2160 (DisplayPort: 60 Hz,”

Dell P2415Q | Dell
http://www.dell.com/en-us/shop/accessories/apd/210-agnk?c=us&l=en&s=dhs&cs=19&sku=210-AGNK

Asus GTX 960 Strix 2GB

“…Trying to switch resolution after the computer has been on for a day or so almost always triggers this issue, but it can sometimes happen earlier also…”

Check your '960’s memory usage prior to doing the above resolution switch. I suspect that the card’s 2GB frame buffer is insufficient at times for how you are currently using your two monitors. You may be better off choosing a lower resolution for your P2415Q and leaving it there as long as you’re going to use both Dells concurrently.

In a bid to find out the max. resolution my STRIX-GTX960-DC2OC-4GD5 can reasonably drive, I’ve used Unigine’s Valley 1.0 and Heaven 4.0 (minimum settings) at a windowed resolution of 3840 x 2160 while monitoring my '960’s memory usage. Said usage bumped up to and at times slightly exceeded 2GB.

The conclusion I’ve drawn from my various Unigine experiments is that a 2GB GTX 960 can be expected to reliably drive a 2560 x 1440 resolution for non-graphics intensive uses at a conservative performance level under all conditions. The STRIX-GTX960-DC2OC-4GD5 suffers the same performance limitation due to its GPU and 1024 CUDA count but has a frame buffer large enough to fully drive two monitors for fairly static display functions.

As well CPU bottle-necking is not a factor in the performance portion of my findings:

Maxwell Bottlenecking Chart - Google Sheets
https://docs.google.com/spreadsheets/d/14LcYGkqVqaHUK_qwS_6N9s1vp-TH7AyHzn7OrNittag/edit#gid=0

(Source)
Sep 21, 2016
Ultimate Bottlenecking Guide! - Graphics Cards - Tom’s Hardware
http://www.tomshardware.com/forum/id-3192807/ultimate-bottlenecking-guide.html

@JGB123321
My issue occurs even on the desktop with nothing else open, just by e.g. opening nvidia-settings and changing the resolution.

I did check how much memory was being used, and nvidia-smi reported about 740MiB of which Xorg was using 630MiB (the rest being used by cinnamon and compton). I then tried to log out to restart Xorg and see how much would be used when Xorg had just started, but logging out and going back to the login screen caused the issue to appear again and the driver crashed as usual. So it’s not a matter of running out of memory, since I had ~1.3GiB available when the driver crashed this time.

And for the record, Xorg is using ~500MiB when I’ve just logged in. That seems a bit excessive, but it could be perfectly normal as far as I know.

Also, I’m now using the NVIDIA driver 375.26 and Linux 4.8.13, and still have the exact same issue.

Could you test some Pascal card?

@Thunderm
No, the GTX 960 is the only NVIDIA card I have at the moment.

I’ve now upgraded to a GTX 1070, and replaced my U2412M monitor with an LG 27UD88-W (along with the P2415Q, both connected via DP). With the 378.13 driver and Linux 4.10.6 the issue remains exactly the same.

Still happens with 415.18 and Linux 4.19.4, hardware same as before. I played some Witcher 3 yesterday (edit: using Wine/DXVK), using 2560x1440 resolution instead of my monitors native 3840x2160, and it was working just fine. When trying to start the game again today without having rebooted the computer both monitors immediately died and Xorg is stuck at 100% CPU as usual. The call trace seems to have changed a bit over the years though, it’s now looks like:

dec 05 13:55:19 awori kernel: INFO: task nvidia-modeset:490 blocked for more than 120 seconds.
dec 05 13:55:19 awori kernel:       Tainted: P           OE     4.19.4-arch1-1-ARCH #1
dec 05 13:55:19 awori kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dec 05 13:55:19 awori kernel: nvidia-modeset  D    0   490      2 0x80000080
dec 05 13:55:19 awori kernel: Call Trace:
dec 05 13:55:19 awori kernel:  ? __schedule+0x29b/0x8b0
dec 05 13:55:19 awori kernel:  ? _raw_q_schedule+0x70/0x70 [nvidia]
dec 05 13:55:19 awori kernel:  schedule+0x32/0x90
dec 05 13:55:19 awori kernel:  schedule_timeout+0x311/0x4a0
dec 05 13:55:19 awori kernel:  ? schedule_timeout+0x311/0x4a0
dec 05 13:55:19 awori kernel:  ? _raw_q_schedule+0x70/0x70 [nvidia]
dec 05 13:55:19 awori kernel:  __down+0x7d/0xd0
dec 05 13:55:19 awori kernel:  down+0x3b/0x50
dec 05 13:55:19 awori kernel:  nvkms_kthread_q_callback+0x61/0xc0 [nvidia_modeset]
dec 05 13:55:19 awori kernel:  _main_loop+0x6f/0x130 [nvidia]
dec 05 13:55:19 awori kernel:  kthread+0x112/0x130
dec 05 13:55:19 awori kernel:  ? kthread_park+0x80/0x80
dec 05 13:55:19 awori kernel:  ret_from_fork+0x35/0x40

Is there a suspend/resume cycle involved?

No, I don’t use suspend/resume, I usually keep the computer on at all times.

It seems like playing the Witcher 3 using DXVK is a very reliable way to recreate the bug. I’ve been playing it for a couple of days now, and trying to start the game again the next day (without rebooting or suspending) has caused the NVIDIA driver to hang 100% of the time.