GPU stuck during deep learning training

envionment:

os:16.04.1-Ubuntu, kernel:4.15.0-70-generic

hardware:1080ti

cuda:10.1

problem:

GPU stuck during deep learning training, nvidia-smi command also stuck,its seems nvidia driver crashed. it happened two times in recent two weeks, last time I restart the kernel and Installation nvidia driver make it work normal. but only few days, it happens same problem, the kernel msg also report bug and Oops

Attached are the last two logs

any suggestions?
thanks!
nvidia-bug-report.log (459 KB)
log.zip (158 KB)

Hi,

Could you please share some sample repro script so we can help better?
Meanwhile, can you try updating to latest CUDA/cuDNN version?

Thanks

I training the model as usual, the gpu stucks, all the gpu thread death

xxxxx@WX-58:~/chenb$ sudo fuser -v /dev/nvidia*
[sudo] password for xxxxx:
                     USER        PID ACCESS COMMAND
/dev/nvidiactl:      xxxxx     31329 F.... python3
                     xxxxx     31421 F.... nvidia-smi
                     xxxxx     31556 F.... nvidia-smi
                     xxxxx     32353 F.... nvidia-smi
                     xxxxx     34822 F.... conda
                     xxxxx     34979 F.... conda

the Attachments use sudo nvidia-bug-report.sh catched logs, which kmesg has kernel bug as follow:

[437607.051444] BUG: unable to handle kernel paging request at 0000000000001064
[437607.051824] IP: _nv027168rm+0x285/0x430 [nvidia]
[437607.051826] PGD 0 P4D 0 
[437607.051830] Oops: 0000 [#1] SMP PTI
[437607.051833] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf vfio_iommu_type1 vfio xt_nat xt_tcpudp veth xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c ip_tables x_tables br_netfilter bridge stp llc aufs overlay snd_hda_codec_hdmi ipmi_ssif snd_hda_intel snd_hda_codec intel_rapl sb_edac snd_hda_core x86_pkg_temp_thermal snd_hwdep intel_powerclamp coretemp kvm_intel snd_pcm kvm snd_seq_midi snd_seq_midi_event irqbypass snd_rawmidi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_seq snd_seq_device snd_timer pcbc aesni_intel snd aes_x86_64 crypto_simd
[437607.051887]  glue_helper soundcore dcdbas cryptd intel_cstate intel_rapl_perf ipmi_si mei_me mei lpc_ich ipmi_msghandler shpchp acpi_power_meter mac_hid parport_pc ppdev lp parport autofs4 mxm_wmi mgag200 ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb dca drm i2c_algo_bit ptp pps_core ahci libahci megaraid_sas wmi [last unloaded: ipmi_devintf]
[437607.051919] CPU: 38 PID: 30021 Comm: python3 Tainted: P           OE    4.15.0-70-generic #79~16.04.1-Ubuntu
[437607.051921] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.4.2 01/09/2017
[437607.052243] RIP: 0010:_nv027168rm+0x285/0x430 [nvidia]
[437607.052245] RSP: 0018:ffffa7250813f958 EFLAGS: 00010292
[437607.052247] RAX: 0000000000000000 RBX: ffff96e47e750008 RCX: 0000000000000020
[437607.052248] RDX: 0000000000000001 RSI: ffff96f2a2e65dcc RDI: 0000000000000001
[437607.052250] RBP: ffff96f2a2e65dd8 R08: 0000000000000020 R09: ffff96f2a2e65dc0
[437607.052251] R10: ffffffffc0e31830 R11: 0000000000000000 R12: 0000000000000001
[437607.052253] R13: ffff96f3162b8008 R14: 0000000000000000 R15: 000000005c000001
[437607.052255] FS:  00007f2c01a67740(0000) GS:ffff96e33fcc0000(0000) knlGS:0000000000000000
[437607.052256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[437607.052258] CR2: 0000000000001064 CR3: 000000100120a003 CR4: 00000000003606e0
[437607.052260] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[437607.052261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[437607.052262] Call Trace:
[437607.052482]  ? _nv027256rm+0x1a3/0x840 [nvidia]
[437607.052696]  ? _nv003594rm+0x9/0x20 [nvidia]
[437607.052911]  ? _nv004345rm+0x1b/0x80 [nvidia]
[437607.053127]  ? _nv011188rm+0x2d7/0x350 [nvidia]
[437607.053343]  ? _nv035367rm+0x89/0x120 [nvidia]
[437607.053558]  ? _nv035366rm+0x250/0x500 [nvidia]
[437607.053774]  ? _nv035363rm+0x56/0x70 [nvidia]
[437607.053989]  ? _nv035364rm+0xad/0xd0 [nvidia]
[437607.054211]  ? _nv034155rm+0xcd/0x160 [nvidia]
[437607.054413]  ? _nv000883rm+0x67/0xa0 [nvidia]
[437607.054614]  ? rm_free_unused_clients+0xcb/0xe0 [nvidia]
[437607.054732]  ? nvidia_close+0x15a/0x2e0 [nvidia]
[437607.054849]  ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[437607.054855]  ? __fput+0xea/0x220
[437607.054859]  ? ____fput+0xe/0x10
[437607.054864]  ? task_work_run+0x8a/0xb0
[437607.054868]  ? do_exit+0x2e9/0xbd0
[437607.054872]  ? do_group_exit+0x43/0xb0
[437607.054875]  ? get_signal+0x169/0x820
[437607.054881]  ? do_signal+0x37/0x730
[437607.054886]  ? do_futex+0x129/0x590
[437607.054893]  ? exit_to_usermode_loop+0x80/0xd0
[437607.054896]  ? do_syscall_64+0x100/0x130
[437607.054903]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[437607.054905] Code: 11 01 00 00 c7 45 08 00 00 00 00 41 8b bd 74 10 00 00 83 ff 20 0f 84 61 01 00 00 e8 96 e1 bb ff 4c 89 ef 41 89 c6 e8 3b 07 bc ff <8b> 88 64 10 00 00 44 89 e0 4c 89 ef d3 e0 f7 d0 41 21 c6 e8 13 
[437607.055258] RIP: _nv027168rm+0x285/0x430 [nvidia] RSP: ffffa7250813f958
[437607.055260] CR2: 0000000000001064
[437607.055263] ---[ end trace 555ade73743b88c2 ]---
[437607.119214] Fixing recursive fault but reboot is needed!

latest CUDA/cuDNN version is 10.1,but mine version is 10.1,do you think its kennel bug or cuda bug?

this happeded not in training middle, just staring training, It’s not a problem in use, it’s a problem in startup
thanks

Hi,

Can you try upgrading to latest CUDA/cuDNN version?
Also run following commands to get latest kernel:
sudo apt-get update
sudo apt-get dist-upgrade

And then reboot the machine.

Thanks