Kernel Task Hang on 390.48 / 396.18 (GTX770)

Hi,

Since upgrading to either of these drivers from 384.111, I get a frequent kernel hang on my Linux 4.13 / 4.15 system with a GTX 770.

There is an Ubuntu bug report here, but I will copy the relevant info below: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-390/+bug/1767932

4.15 + 390.48 hung task:
Apr 30 15:21:50 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:243 blocked for more than 120 seconds.
Apr 30 15:21:50 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
Apr 30 15:21:50 michael-desktop-ubuntu kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Apr 30 15:21:50 michael-desktop-ubuntu kernel: nvidia-modeset D 0 243 2 0x80000000
Apr 30 15:21:50 michael-desktop-ubuntu kernel: Call Trace:
Apr 30 15:21:50 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
Apr 30 15:21:50 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: __down+0x91/0xe0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: down+0x41/0x50
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? down+0x41/0x50
Apr 30 15:21:50 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: kthread+0x121/0x140
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

396.18 + 4.15 hung task:
May 04 08:44:06 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:244 blocked for more than 120 seconds.
May 04 08:44:06 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
May 04 08:44:06 michael-desktop-ubuntu kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
May 04 08:44:06 michael-desktop-ubuntu kernel: nvidia-modeset D 0 244 2 0x80000000
May 04 08:44:06 michael-desktop-ubuntu kernel: Call Trace:
May 04 08:44:06 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
May 04 08:44:06 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
May 04 08:44:06 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
May 04 08:44:06 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
May 04 08:44:06 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
May 04 08:44:06 michael-desktop-ubuntu kernel: ? ttwu_do_activate+0x7a/0x90
May 04 08:44:06 michael-desktop-ubuntu kernel: __down+0x91/0xe0
May 04 08:44:06 michael-desktop-ubuntu kernel: down+0x41/0x50
May 04 08:44:06 michael-desktop-ubuntu kernel: ? down+0x41/0x50
May 04 08:44:06 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
May 04 08:44:06 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
May 04 08:44:06 michael-desktop-ubuntu kernel: kthread+0x121/0x140
May 04 08:44:06 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
May 04 08:44:06 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
May 04 08:44:06 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

nvidia bug report dump: https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-390/+bug/1767932/+attachment/5130530/+files/nvidia-bug-report.log.gz

In the meantime I am upgrading to 396.24, but do not expect it to be resolved.

Same on 396.24

Same on 396.24-0ubuntu0~gpu18.04.1

May 04 22:17:05 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:245 blocked for more than 120 seconds.
May 04 22:17:05 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
May 04 22:17:05 michael-desktop-ubuntu kernel: “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
May 04 22:17:05 michael-desktop-ubuntu kernel: nvidia-modeset D 0 245 2 0x80000000
May 04 22:17:05 michael-desktop-ubuntu kernel: Call Trace:
May 04 22:17:05 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
May 04 22:17:05 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
May 04 22:17:05 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
May 04 22:17:05 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
May 04 22:17:05 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
May 04 22:17:05 michael-desktop-ubuntu kernel: ? ttwu_do_activate+0x7a/0x90
May 04 22:17:05 michael-desktop-ubuntu kernel: __down+0x91/0xe0
May 04 22:17:05 michael-desktop-ubuntu kernel: down+0x41/0x50
May 04 22:17:05 michael-desktop-ubuntu kernel: ? down+0x41/0x50
May 04 22:17:05 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
May 04 22:17:05 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
May 04 22:17:05 michael-desktop-ubuntu kernel: kthread+0x121/0x140
May 04 22:17:05 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
May 04 22:17:05 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
May 04 22:17:05 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

Do as the error suggested:

#/etc/sysctl.d/hung_tasks_timeout.conf

kernel.hung_task_timeout_secs = 0

and reboot. The load on the machine may just be higher than what you may normally expect.

Do you have any other devices such as a USB disk that may be “busy” at the time of the error? Does it happen randomly or at certain events such as turning a monitor on or off?

Hi Hussam,

Disabling that message will not help - it will just cause the kernel to hang without giving any indication where it is hanging.

There is nothing I can tell specifically that triggers the issue, but I do notice it tends to be close to startup. If the machine stays running for the first 10 minutes, then it generally stays up.

Its definitely a deadlock in the nvidia driver, so I don’t think busy loading is the cause either.

Bump

You’re getting DMAR related errors, please check with kernel parameter intel_iommu=off

No luck, error is exactly as before.
Definitely an nvidia driver issue.