I am experiencing a recurring issue with my RTX 4080S GPU on a workstation running Ubuntu 22.04.4 LTS. Below are the details of the problem:
System Configuration:
GPU: RTX 4080S
OS: Ubuntu 22.04.4 LTS
Kernel Version: 5.15.0-78-generic
Environment: Running PyTorch projects in Docker
Problem Description:
After running PyTorch projects in Docker for 1–2 days, the GPU becomes unresponsive, and the nvidia-smi
command hangs indefinitely. Upon checking dmesg
, I found the following call trace:
[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:15:34 2024] veth15dee9a: renamed from eth0
[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:15:34 2024] device veth22b4df5 left promiscuous mode
[Tue Nov 19 18:15:34 2024] docker0: port 1(veth22b4df5) entered disabled state
[Tue Nov 19 18:25:28 2024] INFO: task pt_main_thread:594193 blocked for more than 120 seconds.
[Tue Nov 19 18:25:28 2024] Tainted: P OE 5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:25:28 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:25:28 2024] task:pt_main_thread state:D stack: 0 pid:594193 ppid:416738 flags:0x00004226
[Tue Nov 19 18:25:28 2024] Call Trace:
[Tue Nov 19 18:25:28 2024] <TASK>
[Tue Nov 19 18:25:28 2024] __schedule+0x24e/0x590
[Tue Nov 19 18:25:28 2024] schedule+0x69/0x110
[Tue Nov 19 18:25:28 2024] rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:25:28 2024] ? try_to_wake_up+0x200/0x5a0
[Tue Nov 19 18:25:28 2024] ? native_queued_spin_lock_slowpath+0x2c/0x40
[Tue Nov 19 18:25:28 2024] down_write+0x47/0x60
[Tue Nov 19 18:25:28 2024] os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:25:28 2024] _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv012652rm+0x82/0x1e0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv039216rm+0x67/0x250 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv039216rm+0x44/0x250 [nvidia]
[Tue Nov 19 18:25:28 2024] ? rm_gpu_ops_stop_channel+0x23/0x60 [nvidia]
[Tue Nov 19 18:25:28 2024] ? nvUvmGetSafeStack+0x93/0xc0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? nvUvmInterfaceStopChannel+0x29/0x80 [nvidia]
[Tue Nov 19 18:25:28 2024] ? uvm_user_channel_stop+0x48/0x60 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? uvm_va_space_stop_all_user_channels.part.0+0x98/0x130 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? uvm_va_space_destroy+0xc5/0x6d0 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? uvm_release.constprop.0+0xa3/0x130 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? uvm_release_entry.part.0.isra.0+0x80/0xb0 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? security_file_free+0x54/0x60
[Tue Nov 19 18:25:28 2024] ? kmem_cache_free+0x272/0x290
[Tue Nov 19 18:25:28 2024] ? __call_rcu+0xa8/0x270
[Tue Nov 19 18:25:28 2024] ? uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[Tue Nov 19 18:25:28 2024] ? __fput+0x9c/0x280
[Tue Nov 19 18:25:28 2024] ? ____fput+0xe/0x20
[Tue Nov 19 18:25:28 2024] ? task_work_run+0x6a/0xb0
[Tue Nov 19 18:25:28 2024] ? do_exit+0x217/0x3c0
[Tue Nov 19 18:25:28 2024] ? do_group_exit+0x3b/0xb0
[Tue Nov 19 18:25:28 2024] ? get_signal+0x150/0x900
[Tue Nov 19 18:25:28 2024] ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024] ? arch_do_signal_or_restart+0xde/0x100
[Tue Nov 19 18:25:28 2024] ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024] ? exit_to_user_mode_loop+0xc4/0x160
[Tue Nov 19 18:25:28 2024] ? exit_to_user_mode_prepare+0xa0/0xb0
[Tue Nov 19 18:25:28 2024] ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:25:28 2024] ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:25:28 2024] ? sysvec_apic_timer_interrupt+0x4e/0x90
[Tue Nov 19 18:25:28 2024] ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:25:28 2024] </TASK>
[Tue Nov 19 18:25:28 2024] INFO: task nvidia-smi:596278 blocked for more than 120 seconds.
[Tue Nov 19 18:25:28 2024] Tainted: P OE 5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:25:28 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:25:28 2024] task:nvidia-smi state:D stack: 0 pid:596278 ppid:416738 flags:0x00000226
[Tue Nov 19 18:25:28 2024] Call Trace:
[Tue Nov 19 18:25:28 2024] <TASK>
[Tue Nov 19 18:25:28 2024] __schedule+0x24e/0x590
[Tue Nov 19 18:25:28 2024] schedule+0x69/0x110
[Tue Nov 19 18:25:28 2024] rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:25:28 2024] down_write+0x47/0x60
[Tue Nov 19 18:25:28 2024] os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:25:28 2024] _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv047281rm+0x54/0xd0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv047220rm+0x91/0x410 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv014159rm+0x3f1/0x690 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv045410rm+0x69/0xd0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv012638rm+0x86/0xa0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? _nv000731rm+0xab2/0xeb0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? rm_ioctl+0x58/0xb0 [nvidia]
[Tue Nov 19 18:25:28 2024] ? nvidia_unlocked_ioctl+0x628/0x930 [nvidia]
[Tue Nov 19 18:25:28 2024] ? __x64_sys_ioctl+0x92/0xd0
[Tue Nov 19 18:25:28 2024] ? do_syscall_64+0x59/0xc0
[Tue Nov 19 18:25:28 2024] ? exit_to_user_mode_prepare+0x37/0xb0
[Tue Nov 19 18:25:28 2024] ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:25:28 2024] ? __do_sys_getpid+0x1e/0x30
[Tue Nov 19 18:25:28 2024] ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:25:28 2024] ? do_user_addr_fault+0x1e7/0x670
[Tue Nov 19 18:25:28 2024] ? __do_sys_getpid+0x1e/0x30
[Tue Nov 19 18:25:28 2024] ? exit_to_user_mode_prepare+0x37/0xb0
[Tue Nov 19 18:25:28 2024] ? irqentry_exit_to_user_mode+0x9/0x20
[Tue Nov 19 18:25:28 2024] ? irqentry_exit+0x1d/0x30
[Tue Nov 19 18:25:28 2024] ? exc_page_fault+0x89/0x170
[Tue Nov 19 18:25:28 2024] ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:25:28 2024] </TASK>
[Tue Nov 19 18:27:29 2024] INFO: task nv_queue:28235 blocked for more than 120 seconds.
...
[Tue Nov 19 18:27:29 2024] Tainted: P OE 5.15.0-78-generic #85-Ubuntu
[Tue Nov 19 18:27:29 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Tue Nov 19 18:27:29 2024] task:pt_main_thread state:D stack: 0 pid:594193 ppid:416738 flags:0x00004226
[Tue Nov 19 18:27:29 2024] Call Trace:
[Tue Nov 19 18:27:29 2024] <TASK>
[Tue Nov 19 18:27:29 2024] __schedule+0x24e/0x590
[Tue Nov 19 18:27:29 2024] schedule+0x69/0x110
[Tue Nov 19 18:27:29 2024] rwsem_down_write_slowpath+0x230/0x3e0
[Tue Nov 19 18:27:29 2024] ? try_to_wake_up+0x200/0x5a0
[Tue Nov 19 18:27:29 2024] ? native_queued_spin_lock_slowpath+0x2c/0x40
[Tue Nov 19 18:27:29 2024] down_write+0x47/0x60
[Tue Nov 19 18:27:29 2024] os_acquire_rwlock_write+0x35/0x50 [nvidia]
[Tue Nov 19 18:27:29 2024] _nv044172rm+0x10/0x40 [nvidia]
[Tue Nov 19 18:27:29 2024] ? _nv045433rm+0x26c/0x2f0 [nvidia]
[Tue Nov 19 18:27:29 2024] ? _nv012652rm+0x82/0x1e0 [nvidia]
[Tue Nov 19 18:27:29 2024] ? _nv039216rm+0x67/0x250 [nvidia]
[Tue Nov 19 18:27:29 2024] ? _nv039216rm+0x44/0x250 [nvidia]
[Tue Nov 19 18:27:29 2024] ? rm_gpu_ops_stop_channel+0x23/0x60 [nvidia]
[Tue Nov 19 18:27:29 2024] ? nvUvmGetSafeStack+0x93/0xc0 [nvidia]
[Tue Nov 19 18:27:29 2024] ? nvUvmInterfaceStopChannel+0x29/0x80 [nvidia]
[Tue Nov 19 18:27:29 2024] ? uvm_user_channel_stop+0x48/0x60 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? uvm_va_space_stop_all_user_channels.part.0+0x98/0x130 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? uvm_va_space_destroy+0xc5/0x6d0 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? uvm_release.constprop.0+0xa3/0x130 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? uvm_release_entry.part.0.isra.0+0x80/0xb0 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? security_file_free+0x54/0x60
[Tue Nov 19 18:27:29 2024] ? kmem_cache_free+0x272/0x290
[Tue Nov 19 18:27:29 2024] ? __call_rcu+0xa8/0x270
[Tue Nov 19 18:27:29 2024] ? uvm_release_entry+0x2a/0x30 [nvidia_uvm]
[Tue Nov 19 18:27:29 2024] ? __fput+0x9c/0x280
[Tue Nov 19 18:27:29 2024] ? ____fput+0xe/0x20
[Tue Nov 19 18:27:29 2024] ? task_work_run+0x6a/0xb0
[Tue Nov 19 18:27:29 2024] ? do_exit+0x217/0x3c0
[Tue Nov 19 18:27:29 2024] ? do_group_exit+0x3b/0xb0
[Tue Nov 19 18:27:29 2024] ? get_signal+0x150/0x900
[Tue Nov 19 18:27:29 2024] ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:27:29 2024] ? arch_do_signal_or_restart+0xde/0x100
[Tue Nov 19 18:27:29 2024] ? nvidia_unlocked_ioctl+0x155/0x930 [nvidia]
[Tue Nov 19 18:27:29 2024] ? exit_to_user_mode_loop+0xc4/0x160
[Tue Nov 19 18:27:29 2024] ? exit_to_user_mode_prepare+0xa0/0xb0
[Tue Nov 19 18:27:29 2024] ? syscall_exit_to_user_mode+0x27/0x50
[Tue Nov 19 18:27:29 2024] ? do_syscall_64+0x69/0xc0
[Tue Nov 19 18:27:29 2024] ? sysvec_apic_timer_interrupt+0x4e/0x90
[Tue Nov 19 18:27:29 2024] ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[Tue Nov 19 18:27:29 2024] </TASK>
[Tue Nov 19 18:27:29 2024] INFO: task pt_main_thread:594194 blocked for more than 120 seconds.
No NVRM:XID Error
appear in dmesg
. Additionally, lspci
reports no abnormalities with the GPU’s PCIe information. However, the only way to recover the NVIDIA driver kernel is to reboot the workstation.
Troubleshooting Steps Taken:
- Driver and Kernel Updates: I have tried various NVIDIA driver versions and kernel versions, but the issue persists.
- GPU Swap Test: I tested a colleague’s RTX 4080S from a different AIC vendor, and the issue did not occur, even after extended usage.
- RMA Process:
- I sent my RTX 4080S for RMA. The vendor reported no issues, and all diagnostics passed successfully.
- After reinstalling the GPU, the same issue reappeared within a day.
Key Questions:
- Could this call trace indicate a hardware defect, despite the vendor’s diagnostics reporting no issues?
- Is it possible that the issue lies with the GPU’s VBIOS?
This problem has been ongoing for over two months, and I would greatly appreciate your expertise in diagnosing and resolving this issue. Please let me know if additional logs or information are needed.