RTX 2080 cards crashed when training longer a PyTorch model

Hi There

My OS is Ubuntu 18.04, I have 4 RTX 2080 on board. When I ran smaller training, it worked, but when I started running longer training the very first time, I got this:

nvidia-smi
Unable to determine the device handle for GPU 0000:19:00.0: GPU is lost. Reboot the system to recover this GPU

sudo dmesg |tail -n 60
[76298.527993] wlx7cdd903371a2: send auth to 70:3a:0e:54:49:60 (try 1/3)
[76298.529737] wlx7cdd903371a2: authenticated
[76298.533225] wlx7cdd903371a2: associate with 70:3a:0e:54:49:60 (try 1/3)
[76298.536445] wlx7cdd903371a2: RX AssocResp from 70:3a:0e:54:49:60 (capab=0x431 status=0 aid=1)
[76298.539510] wlx7cdd903371a2: associated
[76298.547546] IPv6: ADDRCONF(NETDEV_CHANGE): wlx7cdd903371a2: link becomes ready
[76617.186011] logitech-hidpp-device 0003:046D:200A.0006: unable to retrieve the name of the device
[76771.242742] NVRM: GPU at PCI:0000:19:00: GPU-63b97003-35a2-c293-323b-8c2342e7bd46
[76771.242747] NVRM: GPU Board Serial Number:
[76771.242751] NVRM: Xid (PCI:0000:19:00): 79, GPU has fallen off the bus.
[76771.242756] NVRM: GPU at 00000000:19:00.0 has fallen off the bus.
[76771.242757] NVRM: GPU is on Board .
[76771.242769] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[79799.361585] logitech-hidpp-device 0003:046D:200A.0006: unable to retrieve the name of the device
[79952.669854] WARNING: CPU: 5 PID: 2407 at /tmp/selfgz3707/NVIDIA-Linux-x86_64-418.39/kernel/nvidia/nv.c:5123 nvidia_dev_put_uuid+0x49/0x50 [nvidia]
[79952.669854] Modules linked in: nvidia_uvm(OE) tcp_diag inet_diag xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c br_netfilter bridge stp llc ccm aufs overlay snd_hda_codec_hdmi nls_iso8859_1 nvidia_drm(POE) nvidia_modeset(POE) arc4 nvidia(POE) snd_seq_midi snd_seq_midi_event intel_rapl rt2800usb x86_pkg_temp_thermal intel_powerclamp coretemp rt2x00usb eeepc_wmi snd_hda_codec_realtek rt2800lib snd_hda_codec_generic asus_wmi snd_rawmidi rt2x00lib sparse_keymap snd_hda_intel mac80211 mxm_wmi video kvm_intel intel_wmi_thunderbolt wmi_bmof cfg80211 snd_hda_codec kvm drm_kms_helper snd_hda_core irqbypass drm crct10dif_pclmul joydev snd_seq
[79952.669879] crc32_pclmul snd_hwdep input_leds ipmi_devintf ghash_clmulni_intel pcbc snd_seq_device ipmi_msghandler aesni_intel fb_sys_fops snd_pcm aes_x86_64 syscopyarea crypto_simd snd_timer glue_helper sysfillrect sysimgblt cryptd intel_cstate snd intel_rapl_perf soundcore mei_me mei ioatdma mac_hid shpchp wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_logitech_hidpp hid_generic hid_logitech_dj usbhid hid igb i2c_algo_bit e1000e dca ptp ahci pps_core libahci
[79952.669901] CPU: 5 PID: 2407 Comm: ZMQbg/1 Tainted: P OE 4.15.0-48-generic #51-Ubuntu
[79952.669901] Hardware name: System manufacturer System Product Name/WS X299 SAGE, BIOS 0905 11/30/2018
[79952.669979] RIP: 0010:nvidia_dev_put_uuid+0x49/0x50 [nvidia]
[79952.669980] RSP: 0018:ffffafbda4b5fb68 EFLAGS: 00010202
[79952.669980] RAX: 0000000000000026 RBX: ffff928194e40800 RCX: ffffafbda4b5fb08
[79952.669981] RDX: 0000000000000087 RSI: 0000000000000246 RDI: 0000000000000246
[79952.669981] RBP: ffffafbda4b5fb78 R08: ffffcfbd7f960310 R09: 0000000000000000
[79952.669982] R10: ffffffffc3121bc0 R11: 0000000000000400 R12: ffff9271a7143000
[79952.669982] R13: ffffffffc09dbfe0 R14: ffffafbd8680e5c0 R15: 0000000000000000
[79952.669983] FS: 00007fba1d140700(0000) GS:ffff92819f540000(0000) knlGS:0000000000000000
[79952.669984] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79952.669984] CR2: 00007f607f8bd100 CR3: 0000000aafa0a001 CR4: 00000000003606e0
[79952.669985] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[79952.669985] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[79952.669986] Call Trace:
[79952.670042] nvUvmInterfaceUnregisterGpu+0x22/0x30 [nvidia]
[79952.670053] remove_gpu+0x28b/0x2e0 [nvidia_uvm]
[79952.670057] uvm_gpu_release_locked+0x17/0x20 [nvidia_uvm]
[79952.670062] uvm_va_space_destroy+0x3f7/0x490 [nvidia_uvm]
[79952.670066] uvm_release+0x11/0x20 [nvidia_uvm]
[79952.670068] __fput+0xea/0x220
[79952.670069] ____fput+0xe/0x10
[79952.670071] task_work_run+0x9d/0xc0
[79952.670073] do_exit+0x2ec/0xb40
[79952.670074] do_group_exit+0x43/0xb0
[79952.670075] get_signal+0x27b/0x590
[79952.670093] do_signal+0x37/0x730
[79952.670095] ? compat_poll_select_copy_remaining+0x130/0x130
[79952.670096] ? compat_poll_select_copy_remaining+0x130/0x130
[79952.670098] exit_to_usermode_loop+0x73/0xd0
[79952.670099] do_syscall_64+0x115/0x130
[79952.670101] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[79952.670103] RIP: 0033:0x7fbe71f15bf9
[79952.670103] RSP: 002b:00007fba1d13fda0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
[79952.670104] RAX: fffffffffffffdfc RBX: 00007fba10000bd0 RCX: 00007fbe71f15bf9
[79952.670104] RDX: 0000000000000064 RSI: 000000000000000b RDI: 00007fba10000bd0
[79952.670105] RBP: 000000000000000b R08: 0000000000000000 R09: 00007fba1d140700
[79952.670105] R10: 00007fba1d13fdb0 R11: 0000000000000293 R12: 0000000000000064
[79952.670106] R13: 0000557a696a0400 R14: 00005579bed2ad78 R15: 000000000000000b
[79952.670106] Code: c7 4c 89 e6 e8 69 b7 ff ff 31 d2 48 89 de 4c 89 e7 e8 3c c6 6d 00 85 c0 75 11 48 8d bb 58 04 00 00 e8 fc 3d 34 ca 5b 41 5c 5d c3 <0f> 0b eb eb 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53
[79952.670123] —[ end trace b855b128c44ee309 ]—

The log is in attachment.
Thanks!
Bao
nvidia-bug-report.log.gz (1.81 MB)

xid 79 points to lack of power or overheating.

Thanks. What’s the threshold temperature for RTX 2080?

Around 100°C

Appreciate