RTX4080S Nvidia Driver Frequent Hangs with Call Trace

I am experiencing a recurring issue with my RTX 4080S GPU on a workstation running Ubuntu 22.04.4 LTS. Below are the details of the problem:
System Configuration:

GPU: RTX 4080S
OS: Ubuntu 22.04.4 LTS
Kernel Version: 5.15.0-78-generic
Environment: Running PyTorch projects in Docker

Problem Description:

After running PyTorch projects in Docker for 1–2 days, the GPU becomes unresponsive, and the nvidia-smi command hangs indefinitely. Upon checking dmesg, I found the following call trace:

[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:15:34 2024] veth15dee9a: renamed from eth0
[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:15:34 2024] device veth22b4df5 left promiscuous mode
[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:25:28 2024] INFO: task pt_main_thread:594193 blocked for more than 120 seconds.
[Tue Nov 19 18:25:28 2024]       Tainted: P           OE     5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:25:28 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:25:28 2024] task:pt_main_thread  state:D stack:    0 pid:594193 ppid:416738 flags:0x00004226
[Tue Nov 19 18:25:28 2024] Call Trace:
[Tue Nov 19 18:25:28 2024]  <TASK>
[Tue Nov 19 18:25:28 2024]  __schedule+0x24e/0x590
[Tue Nov 19 18:25:28 2024]  schedule+0x69/0x110
[Tue Nov 19 18:25:28 2024]  rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:25:28 2024]  ? try_to_wake_up+0x200/0x5a0
[Tue Nov 19 18:25:28 2024]  ? native_queued_spin_lock_slowpath+0x2c/0x40
[Tue Nov 19 18:25:28 2024]  down_write+0x47/0x60
[Tue Nov 19 18:25:28 2024]  os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:25:28 2024]  _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv012652rm+0x82/0x1e0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv039216rm+0x67/0x250 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv039216rm+0x44/0x250 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? rm_gpu_ops_stop_channel+0x23/0x60 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? nvUvmGetSafeStack+0x93/0xc0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? nvUvmInterfaceStopChannel+0x29/0x80 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? uvm_user_channel_stop+0x48/0x60 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? uvm_va_space_stop_all_user_channels.part.0+0x98/0x130 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? uvm_va_space_destroy+0xc5/0x6d0 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? uvm_release.constprop.0+0xa3/0x130 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? uvm_release_entry.part.0.isra.0+0x80/0xb0 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? security_file_free+0x54/0x60
[Tue Nov 19 18:25:28 2024]  ? kmem_cache_free+0x272/0x290
[Tue Nov 19 18:25:28 2024]  ? __call_rcu+0xa8/0x270
[Tue Nov 19 18:25:28 2024]  ? uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024]  ? __fput+0x9c/0x280
[Tue Nov 19 18:25:28 2024]  ? ____fput+0xe/0x20
[Tue Nov 19 18:25:28 2024]  ? task_work_run+0x6a/0xb0
[Tue Nov 19 18:25:28 2024]  ? do_exit+0x217/0x3c0
[Tue Nov 19 18:25:28 2024]  ? do_group_exit+0x3b/0xb0
[Tue Nov 19 18:25:28 2024]  ? get_signal+0x150/0x900
[Tue Nov 19 18:25:28 2024]  ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? arch_do_signal_or_restart+0xde/0x100
[Tue Nov 19 18:25:28 2024]  ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? exit_to_user_mode_loop+0xc4/0x160
[Tue Nov 19 18:25:28 2024]  ? exit_to_user_mode_prepare+0xa0/0xb0
[Tue Nov 19 18:25:28 2024]  ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:25:28 2024]  ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:25:28 2024]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[Tue Nov 19 18:25:28 2024]  ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:25:28 2024]  </TASK>
[Tue Nov 19 18:25:28 2024] INFO: task nvidia-smi:596278 blocked for more than 120 seconds.
[Tue Nov 19 18:25:28 2024]       Tainted: P           OE     5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:25:28 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:25:28 2024] task:nvidia-smi      state:D stack:    0 pid:596278 ppid:416738 flags:0x00000226
[Tue Nov 19 18:25:28 2024] Call Trace:
[Tue Nov 19 18:25:28 2024]  <TASK>
[Tue Nov 19 18:25:28 2024]  __schedule+0x24e/0x590
[Tue Nov 19 18:25:28 2024]  schedule+0x69/0x110
[Tue Nov 19 18:25:28 2024]  rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:25:28 2024]  down_write+0x47/0x60
[Tue Nov 19 18:25:28 2024]  os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:25:28 2024]  _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv047281rm+0x54/0xd0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv047220rm+0x91/0x410 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv014159rm+0x3f1/0x690 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv045410rm+0x69/0xd0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv012638rm+0x86/0xa0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? _nv000731rm+0xab2/0xeb0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? rm_ioctl+0x58/0xb0 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? nvidia_unlocked_ioctl+0x628/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024]  ? __x64_sys_ioctl+0x92/0xd0
[Tue Nov 19 18:25:28 2024]  ? do_syscall_64+0x59/0xc0
[Tue Nov 19 18:25:28 2024]  ? exit_to_user_mode_prepare+0x37/0xb0
[Tue Nov 19 18:25:28 2024]  ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:25:28 2024]  ? __do_sys_getpid+0x1e/0x30
[Tue Nov 19 18:25:28 2024]  ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:25:28 2024]  ? do_user_addr_fault+0x1e7/0x670
[Tue Nov 19 18:25:28 2024]  ? __do_sys_getpid+0x1e/0x30
[Tue Nov 19 18:25:28 2024]  ? exit_to_user_mode_prepare+0x37/0xb0
[Tue Nov 19 18:25:28 2024]  ? irqentry_exit_to_user_mode+0x9/0x20
[Tue Nov 19 18:25:28 2024]  ? irqentry_exit+0x1d/0x30
[Tue Nov 19 18:25:28 2024]  ? exc_page_fault+0x89/0x170
[Tue Nov 19 18:25:28 2024]  ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:25:28 2024]  </TASK>
[Tue Nov 19 18:27:29 2024] INFO: task nv_queue:28235 blocked for more than 120 seconds.
...
[Tue Nov 19 18:27:29 2024]       Tainted: P           OE     5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:27:29 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:27:29 2024] task:pt_main_thread  state:D stack:    0 pid:594193 ppid:416738 flags:0x00004226
[Tue Nov 19 18:27:29 2024] Call Trace:
[Tue Nov 19 18:27:29 2024]  <TASK>
[Tue Nov 19 18:27:29 2024]  __schedule+0x24e/0x590
[Tue Nov 19 18:27:29 2024]  schedule+0x69/0x110
[Tue Nov 19 18:27:29 2024]  rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:27:29 2024]  ? try_to_wake_up+0x200/0x5a0
[Tue Nov 19 18:27:29 2024]  ? native_queued_spin_lock_slowpath+0x2c/0x40
[Tue Nov 19 18:27:29 2024]  down_write+0x47/0x60
[Tue Nov 19 18:27:29 2024]  os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:27:29 2024]  _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? _nv012652rm+0x82/0x1e0 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? _nv039216rm+0x67/0x250 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? _nv039216rm+0x44/0x250 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? rm_gpu_ops_stop_channel+0x23/0x60 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? nvUvmGetSafeStack+0x93/0xc0 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? nvUvmInterfaceStopChannel+0x29/0x80 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? uvm_user_channel_stop+0x48/0x60 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? uvm_va_space_stop_all_user_channels.part.0+0x98/0x130 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? uvm_va_space_destroy+0xc5/0x6d0 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? uvm_release.constprop.0+0xa3/0x130 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? uvm_release_entry.part.0.isra.0+0x80/0xb0 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? security_file_free+0x54/0x60
[Tue Nov 19 18:27:29 2024]  ? kmem_cache_free+0x272/0x290
[Tue Nov 19 18:27:29 2024]  ? __call_rcu+0xa8/0x270
[Tue Nov 19 18:27:29 2024]  ? uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024]  ? __fput+0x9c/0x280
[Tue Nov 19 18:27:29 2024]  ? ____fput+0xe/0x20
[Tue Nov 19 18:27:29 2024]  ? task_work_run+0x6a/0xb0
[Tue Nov 19 18:27:29 2024]  ? do_exit+0x217/0x3c0
[Tue Nov 19 18:27:29 2024]  ? do_group_exit+0x3b/0xb0
[Tue Nov 19 18:27:29 2024]  ? get_signal+0x150/0x900
[Tue Nov 19 18:27:29 2024]  ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? arch_do_signal_or_restart+0xde/0x100
[Tue Nov 19 18:27:29 2024]  ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:27:29 2024]  ? exit_to_user_mode_loop+0xc4/0x160
[Tue Nov 19 18:27:29 2024]  ? exit_to_user_mode_prepare+0xa0/0xb0
[Tue Nov 19 18:27:29 2024]  ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:27:29 2024]  ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:27:29 2024]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[Tue Nov 19 18:27:29 2024]  ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:27:29 2024]  </TASK>
[Tue Nov 19 18:27:29 2024] INFO: task pt_main_thread:594194 blocked for more than 120 seconds.

No NVRM:XID Error appear in dmesg. Additionally, lspci reports no abnormalities with the GPU’s PCIe information. However, the only way to recover the NVIDIA driver kernel is to reboot the workstation.

Troubleshooting Steps Taken:

  1. Driver and Kernel Updates: I have tried various NVIDIA driver versions and kernel versions, but the issue persists.
  2. GPU Swap Test: I tested a colleague’s RTX 4080S from a different AIC vendor, and the issue did not occur, even after extended usage.
  3. RMA Process:
  • I sent my RTX 4080S for RMA. The vendor reported no issues, and all diagnostics passed successfully.
  • After reinstalling the GPU, the same issue reappeared within a day.

Key Questions:

  1. Could this call trace indicate a hardware defect, despite the vendor’s diagnostics reporting no issues?
  2. Is it possible that the issue lies with the GPU’s VBIOS?

This problem has been ongoing for over two months, and I would greatly appreciate your expertise in diagnosing and resolving this issue. Please let me know if additional logs or information are needed.