- System: ubuntu 20.04 (Linux version 5.13.0-48-generic)
- NVIDIA driver version: 510.73.05
- CUDA version: 11.3
- PyTorch version: 1.11.0+cu113
- GPU card: GALAX 3080 12G
Hi! Recently when I run my PyTorch programs on my RTX 3080 12G, they sometimes fall into the state of “uninterruptible sleep” (state D
on htop). When it happens:
- All GPU-related programs have no response, including
nvidia-smi
,nvidia-bug-report.sh
(they all fall into “uninterruptible sleep” state) - I’m not able to reboot the machine via
sudo reboot
. It seems only pressing the reboot button works. - The fans of gpu card spin very fast
- However, the gpu’s temperature seems ok (tested with my hand), and after a reboot, nvidia-smi shows a temperature of 40C.
This happens quite frequently, as I encounter it every 2~3 days. During the latest breakdown, I called dmesg
and it shows the following messages. I’m not sure if it is a driver issue. Could anyone tell me how can I trace the bug? Thanks!
[28293.201947] Btrfs loaded, crc32c=crc32c-intel, zoned=yes
[61953.689620] mce: [Hardware Error]: Machine check events logged
[61953.689623] [Hardware Error]: Corrected error, no action required.
[61953.689626] [Hardware Error]: CPU:1 (19:21:2) MC8_STATUS[-|CE|-|-|-|-|-|-|-]: 0x8000000100f9a163
[61953.689631] [Hardware Error]: IPID: 0x0000000000000000
[61953.689632] [Hardware Error]: L3 Cache Ext. Error Code: 57
[61953.689633] [Hardware Error]: cache level: L3/GEN, tx: INSN
[102335.965826] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[391501.205407] NVRM: GPU at PCI:0000:0a:00: GPU-6b27e0e6-3381-5be8-750f-de6e7813ab97
[391501.205413] NVRM: Xid (PCI:0000:0a:00): 32, pid=1105, Channel ID 00000000 intr 00008000
[391501.205901] NVRM: Xid (PCI:0000:0a:00): 32, pid=1105, Channel ID 00000000 intr 00008000
[392102.955112] INFO: task python:73719 blocked for more than 120 seconds.
[392102.955119] Tainted: P OE 5.13.0-44-generic #49~20.04.1-Ubuntu
[392102.955121] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[392102.955122] task:python state:D stack: 0 pid:73719 ppid: 72384 flags:0x00000004
[392102.955127] Call Trace:
[392102.955129] <TASK>
[392102.955132] __schedule+0x2ee/0x900
[392102.955139] schedule+0x4f/0xc0
[392102.955141] schedule_timeout+0x202/0x290
[392102.955144] __down+0x82/0xd0
[392102.955147] down+0x47/0x60
[392102.955150] nvidia_ioctl+0xb5/0x8f0 [nvidia]
[392102.955384] ? __x64_sys_futex+0x7b/0x1b0
[392102.955388] nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[392102.955622] __x64_sys_ioctl+0x91/0xc0
[392102.955626] do_syscall_64+0x61/0xb0
[392102.955629] ? do_syscall_64+0x6e/0xb0
[392102.955630] ? do_syscall_64+0x6e/0xb0
[392102.955632] ? do_syscall_64+0x6e/0xb0
[392102.955633] ? syscall_exit_to_user_mode+0x27/0x50
[392102.955636] entry_SYSCALL_64_after_hwframe+0x44/0xae
[392102.955639] RIP: 0033:0x7f2d303d53ab
[392102.955642] RSP: 002b:00007f2c5cfceb28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[392102.955645] RAX: ffffffffffffffda RBX: 00007f2c5cfcebc0 RCX: 00007f2d303d53ab
[392102.955646] RDX: 00007f2c5cfcebc0 RSI: 00000000c020462a RDI: 0000000000000003
[392102.955648] RBP: 00000000c020462a R08: 00007f2c5cfcebc0 R09: 00007f2c5cfcebdc
[392102.955649] R10: 00007ffd7b3ee1b0 R11: 0000000000000246 R12: 0000000000000003
[392102.955650] R13: 00007f2c5cfcebdc R14: 0000000062a6a9a3 R15: 00007f2c5cfceb30
[392102.955652] </TASK>
I uploaded the complete output of dmesg
during a breakdown. When I called nvidia-bug-report.sh
during the breakdown, it hangs and nothing generated. The attached file is the output of nvidia-bug-report.sh
immediately after the reboot.
nvidia-bug-report.log.gz (316.3 KB)
dmesg.txt (69.3 KB)