GPU-related program often goes to state of "uninterruptible sleep"

  • System: ubuntu 20.04 (Linux version 5.13.0-48-generic)
  • NVIDIA driver version: 510.73.05
  • CUDA version: 11.3
  • PyTorch version: 1.11.0+cu113
  • GPU card: GALAX 3080 12G

Hi! Recently when I run my PyTorch programs on my RTX 3080 12G, they sometimes fall into the state of “uninterruptible sleep” (state D on htop). When it happens:

  • All GPU-related programs have no response, including nvidia-smi, nvidia-bug-report.sh (they all fall into “uninterruptible sleep” state)
  • I’m not able to reboot the machine via sudo reboot. It seems only pressing the reboot button works.
  • The fans of gpu card spin very fast
  • However, the gpu’s temperature seems ok (tested with my hand), and after a reboot, nvidia-smi shows a temperature of 40C.

This happens quite frequently, as I encounter it every 2~3 days. During the latest breakdown, I called dmesg and it shows the following messages. I’m not sure if it is a driver issue. Could anyone tell me how can I trace the bug? Thanks!

[28293.201947] Btrfs loaded, crc32c=crc32c-intel, zoned=yes
[61953.689620] mce: [Hardware Error]: Machine check events logged
[61953.689623] [Hardware Error]: Corrected error, no action required.
[61953.689626] [Hardware Error]: CPU:1 (19:21:2) MC8_STATUS[-|CE|-|-|-|-|-|-|-]: 0x8000000100f9a163
[61953.689631] [Hardware Error]: IPID: 0x0000000000000000
[61953.689632] [Hardware Error]: L3 Cache Ext. Error Code: 57
[61953.689633] [Hardware Error]: cache level: L3/GEN, tx: INSN
[102335.965826] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[391501.205407] NVRM: GPU at PCI:0000:0a:00: GPU-6b27e0e6-3381-5be8-750f-de6e7813ab97
[391501.205413] NVRM: Xid (PCI:0000:0a:00): 32, pid=1105, Channel ID 00000000 intr 00008000
[391501.205901] NVRM: Xid (PCI:0000:0a:00): 32, pid=1105, Channel ID 00000000 intr 00008000
[392102.955112] INFO: task python:73719 blocked for more than 120 seconds.
[392102.955119]       Tainted: P           OE     5.13.0-44-generic #49~20.04.1-Ubuntu
[392102.955121] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[392102.955122] task:python          state:D stack:    0 pid:73719 ppid: 72384 flags:0x00000004
[392102.955127] Call Trace:
[392102.955129]  <TASK>
[392102.955132]  __schedule+0x2ee/0x900
[392102.955139]  schedule+0x4f/0xc0
[392102.955141]  schedule_timeout+0x202/0x290
[392102.955144]  __down+0x82/0xd0
[392102.955147]  down+0x47/0x60
[392102.955150]  nvidia_ioctl+0xb5/0x8f0 [nvidia]
[392102.955384]  ? __x64_sys_futex+0x7b/0x1b0
[392102.955388]  nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[392102.955622]  __x64_sys_ioctl+0x91/0xc0
[392102.955626]  do_syscall_64+0x61/0xb0
[392102.955629]  ? do_syscall_64+0x6e/0xb0
[392102.955630]  ? do_syscall_64+0x6e/0xb0
[392102.955632]  ? do_syscall_64+0x6e/0xb0
[392102.955633]  ? syscall_exit_to_user_mode+0x27/0x50
[392102.955636]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[392102.955639] RIP: 0033:0x7f2d303d53ab
[392102.955642] RSP: 002b:00007f2c5cfceb28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[392102.955645] RAX: ffffffffffffffda RBX: 00007f2c5cfcebc0 RCX: 00007f2d303d53ab
[392102.955646] RDX: 00007f2c5cfcebc0 RSI: 00000000c020462a RDI: 0000000000000003
[392102.955648] RBP: 00000000c020462a R08: 00007f2c5cfcebc0 R09: 00007f2c5cfcebdc
[392102.955649] R10: 00007ffd7b3ee1b0 R11: 0000000000000246 R12: 0000000000000003
[392102.955650] R13: 00007f2c5cfcebdc R14: 0000000062a6a9a3 R15: 00007f2c5cfceb30
[392102.955652]  </TASK>

I uploaded the complete output of dmesg during a breakdown. When I called nvidia-bug-report.sh during the breakdown, it hangs and nothing generated. The attached file is the output of nvidia-bug-report.sh immediately after the reboot.

nvidia-bug-report.log.gz (316.3 KB)
dmesg.txt (69.3 KB)