CUDA 11.4, Driver version is 470.57.02; 8x GTX 1080Ti’s are installed on the server.
PyTorch Version is 1.7.1+cu101, which is installed with pip.
I understand that the CUDA version of PyTorch is not corresponding to that of CUDA Toolkit, but it seems unreasonable to directly cause a kernel panic instead of an application crash. This seems to be a bug of the NVIDIA kernel module.
Logs are pasted below.
Crash Log
The following crash log is exported with kdump and extracted with the crash command line utilify.
KERNEL: vmlinux-5.4.0-81-generic
DUMPFILE: dump.202109071343 [PARTIAL DUMP]
CPUS: 40
DATE: Tue Sep 7 21:42:22 2021
UPTIME: 01:30:34
LOAD AVERAGE: 6.78, 6.05, 4.70
TASKS: 1093
RELEASE: 5.4.0-81-generic
VERSION: #91-Ubuntu SMP Thu Jul 15 19:09:17 UTC 2021
MACHINE: x86_64 (2400 Mhz)
MEMORY: 191.9 GB
PANIC: "Oops: 0000 [#1] SMP PTI" (check log for details)
PID: 608479
COMMAND: "python"
TASK: ffff8e1837d91740 [THREAD_INFO: ffff8e1837d91740]
CPU: 12
STATE: TASK_RUNNING (PANIC)
PID: 608479 TASK: ffff8e1837d91740 CPU: 12 COMMAND: "python"
#0 [ffffb3ca29e3b758] machine_kexec at ffffffffaf66b7c3
#1 [ffffb3ca29e3b7b8] __crash_kexec at ffffffffaf749822
#2 [ffffb3ca29e3b888] crash_kexec at ffffffffaf74a5a9
#3 [ffffb3ca29e3b8a0] oops_end at ffffffffaf6344a9
#4 [ffffb3ca29e3b8c8] no_context at ffffffffaf67a19e
#5 [ffffb3ca29e3b938] __bad_area_nosemaphore at ffffffffaf67a3b0
#6 [ffffb3ca29e3b980] bad_area_nosemaphore at ffffffffaf67a516
#7 [ffffb3ca29e3b990] do_user_addr_fault at ffffffffaf67aa37
#8 [ffffb3ca29e3b9f8] __do_page_fault at ffffffffaf67af58
#9 [ffffb3ca29e3ba20] do_page_fault at ffffffffaf67afbc
#10 [ffffb3ca29e3ba50] page_fault at ffffffffb0201284
[exception RIP: _nv029462rm+1070]
RIP: ffffffffc1110f7e RSP: ffffb3ca29e3bb00 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8e16e51e4008 RCX: 0000000000000001
RDX: ffff8e16e51e4008 RSI: 0000000000000000 RDI: ffff8e16da5a0008
RBP: ffff8e172bd62d30 R8: 0000000000000001 R9: ffffffffc0ce4c00
R10: ffff8e16e51e0000 R11: 0000000000000001 R12: ffff8e16da5a0008
R13: ffff8e16da5a0008 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#11 [ffffb3ca29e3bb28] _nv029436rm at ffffffffc0d2aa59 [nvidia]
#12 [ffffb3ca29e3bb58] _nv002278rm at ffffffffc14476a9 [nvidia]
#13 [ffffb3ca29e3bb68] _nv003733rm at ffffffffc1442f7b [nvidia]
#14 [ffffb3ca29e3bb88] _nv014655rm at ffffffffc143edd6 [nvidia]
#15 [ffffb3ca29e3bbb8] _nv037695rm at ffffffffc143d313 [nvidia]
#16 [ffffb3ca29e3bbe8] _nv037694rm at ffffffffc143d647 [nvidia]
#17 [ffffb3ca29e3bc18] _nv037689rm at ffffffffc143d9e0 [nvidia]
#18 [ffffb3ca29e3bc38] _nv037690rm at ffffffffc143db0b [nvidia]
#19 [ffffb3ca29e3bc68] _nv036056rm at ffffffffc0d5bd10 [nvidia]
#20 [ffffb3ca29e3bc88] _nv000699rm at ffffffffc167b4c8 [nvidia]
#21 [ffffb3ca29e3bca8] rm_cleanup_file_private at ffffffffc167c58a [nvidia]
#22 [ffffb3ca29e3bd78] nvidia_close at ffffffffc0cda9e9 [nvidia]
#23 [ffffb3ca29e3bde0] __fput at ffffffffaf8cc63c
#24 [ffffb3ca29e3be30] ____fput at ffffffffaf8cc83e
#25 [ffffb3ca29e3be40] task_work_run at ffffffffaf6bdb0f
#26 [ffffb3ca29e3be78] do_exit at ffffffffaf69e31e
#27 [ffffb3ca29e3bef0] do_group_exit at ffffffffaf69eb47
#28 [ffffb3ca29e3bf20] __x64_sys_exit_group at ffffffffaf69ebc8
#29 [ffffb3ca29e3bf30] do_syscall_64 at ffffffffaf603fd7
#30 [ffffb3ca29e3bf50] entry_SYSCALL_64_after_hwframe at ffffffffb020008c
RIP: 00007fac464532c6 RSP: 00007ffd4eab0478 RFLAGS: 00000213
RAX: ffffffffffffffda RBX: 000055e76c3f3f90 RCX: 00007fac464532c6
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00007fac45fe2360 R8: 00000000000000e7 R9: ffffffffffffff80
R10: 00000000000000a1 R11: 0000000000000213 R12: 8000000000000001
R13: 00007faa92d321f0 R14: 00007faa92d32040 R15: 00007faa92d321e8
ORIG_RAX: 00000000000000e7 CS: 0033 SS: 002b
Kernel module info
filename: /lib/modules/5.4.0-81-generic/kernel/drivers/video/nvidia.ko
firmware: nvidia/470.57.02/gsp.bin
alias: char-major-195-*
version: 470.57.02
supported: external
license: NVIDIA
srcversion: 00F9E8DEACC0FB98727C03C
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: drm
retpoline: Y
name: nvidia
vermagic: 5.4.0-81-generic SMP mod_unload modversions
parm: NvSwitchRegDwords:NvSwitch regkey (charp)
parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_EnablePCIeGen3:int
parm: NVreg_EnableMSI:int
parm: NVreg_TCEBypassMode:int
parm: NVreg_EnableStreamMemOPs:int
parm: NVreg_RestrictProfilingToAdminUsers:int
parm: NVreg_PreserveVideoMemoryAllocations:int
parm: NVreg_EnableS0ixPowerManagement:int
parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int
parm: NVreg_DynamicPowerManagement:int
parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int
parm: NVreg_EnableGpuFirmware:int
parm: NVreg_EnableUserNUMAManagement:int
parm: NVreg_MemoryPoolSize:int
parm: NVreg_KMallocHeapMaxSize:int
parm: NVreg_VMallocHeapMaxSize:int
parm: NVreg_IgnoreMMIOCheck:int
parm: NVreg_NvLinkDisable:int
parm: NVreg_EnablePCIERelaxedOrderingMode:int
parm: NVreg_RegisterPCIDriver:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RegistryDwordsPerDevice:charp
parm: NVreg_RmMsg:charp
parm: NVreg_GpuBlacklist:charp
parm: NVreg_TemporaryFilePath:charp
parm: NVreg_ExcludedGpus:charp
parm: rm_firmware_active:charp