GPU stuck during deep learning training

609345256 · March 11, 2020, 2:38am

envionment:

os:16.04.1-Ubuntu, kernel:4.15.0-70-generic

hardware：1080ti

cuda:10.1

problem:

GPU stuck during deep learning training, nvidia-smi command also stuck,its seems nvidia driver crashed. it happened two times in recent two weeks, last time I restart the kernel and Installation nvidia driver make it work normal. but only few days, it happens same problem, the kernel msg also report bug and Oops

Attached are the last two logs

any suggestions?
thanks!
nvidia-bug-report.log (459 KB)
log.zip (158 KB)

SunilJB · March 11, 2020, 4:53am

Hi,

Could you please share some sample repro script so we can help better?
Meanwhile, can you try updating to latest CUDA/cuDNN version?

Thanks

609345256 · March 11, 2020, 6:58am

I training the model as usual, the gpu stucks, all the gpu thread death

xxxxx@WX-58:~/chenb$ sudo fuser -v /dev/nvidia*
[sudo] password for xxxxx:
                     USER        PID ACCESS COMMAND
/dev/nvidiactl:      xxxxx     31329 F.... python3
                     xxxxx     31421 F.... nvidia-smi
                     xxxxx     31556 F.... nvidia-smi
                     xxxxx     32353 F.... nvidia-smi
                     xxxxx     34822 F.... conda
                     xxxxx     34979 F.... conda

the Attachments use sudo nvidia-bug-report.sh catched logs, which kmesg has kernel bug as follow:

[437607.051444] BUG: unable to handle kernel paging request at 0000000000001064
[437607.051824] IP: _nv027168rm+0x285/0x430 [nvidia]
[437607.051826] PGD 0 P4D 0 
[437607.051830] Oops: 0000 [#1] SMP PTI
[437607.051833] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) ipmi_devintf vfio_iommu_type1 vfio xt_nat xt_tcpudp veth xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c ip_tables x_tables br_netfilter bridge stp llc aufs overlay snd_hda_codec_hdmi ipmi_ssif snd_hda_intel snd_hda_codec intel_rapl sb_edac snd_hda_core x86_pkg_temp_thermal snd_hwdep intel_powerclamp coretemp kvm_intel snd_pcm kvm snd_seq_midi snd_seq_midi_event irqbypass snd_rawmidi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_seq snd_seq_device snd_timer pcbc aesni_intel snd aes_x86_64 crypto_simd
[437607.051887]  glue_helper soundcore dcdbas cryptd intel_cstate intel_rapl_perf ipmi_si mei_me mei lpc_ich ipmi_msghandler shpchp acpi_power_meter mac_hid parport_pc ppdev lp parport autofs4 mxm_wmi mgag200 ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb dca drm i2c_algo_bit ptp pps_core ahci libahci megaraid_sas wmi [last unloaded: ipmi_devintf]
[437607.051919] CPU: 38 PID: 30021 Comm: python3 Tainted: P           OE    4.15.0-70-generic #79~16.04.1-Ubuntu
[437607.051921] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.4.2 01/09/2017
[437607.052243] RIP: 0010:_nv027168rm+0x285/0x430 [nvidia]
[437607.052245] RSP: 0018:ffffa7250813f958 EFLAGS: 00010292
[437607.052247] RAX: 0000000000000000 RBX: ffff96e47e750008 RCX: 0000000000000020
[437607.052248] RDX: 0000000000000001 RSI: ffff96f2a2e65dcc RDI: 0000000000000001
[437607.052250] RBP: ffff96f2a2e65dd8 R08: 0000000000000020 R09: ffff96f2a2e65dc0
[437607.052251] R10: ffffffffc0e31830 R11: 0000000000000000 R12: 0000000000000001
[437607.052253] R13: ffff96f3162b8008 R14: 0000000000000000 R15: 000000005c000001
[437607.052255] FS:  00007f2c01a67740(0000) GS:ffff96e33fcc0000(0000) knlGS:0000000000000000
[437607.052256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[437607.052258] CR2: 0000000000001064 CR3: 000000100120a003 CR4: 00000000003606e0
[437607.052260] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[437607.052261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[437607.052262] Call Trace:
[437607.052482]  ? _nv027256rm+0x1a3/0x840 [nvidia]
[437607.052696]  ? _nv003594rm+0x9/0x20 [nvidia]
[437607.052911]  ? _nv004345rm+0x1b/0x80 [nvidia]
[437607.053127]  ? _nv011188rm+0x2d7/0x350 [nvidia]
[437607.053343]  ? _nv035367rm+0x89/0x120 [nvidia]
[437607.053558]  ? _nv035366rm+0x250/0x500 [nvidia]
[437607.053774]  ? _nv035363rm+0x56/0x70 [nvidia]
[437607.053989]  ? _nv035364rm+0xad/0xd0 [nvidia]
[437607.054211]  ? _nv034155rm+0xcd/0x160 [nvidia]
[437607.054413]  ? _nv000883rm+0x67/0xa0 [nvidia]
[437607.054614]  ? rm_free_unused_clients+0xcb/0xe0 [nvidia]
[437607.054732]  ? nvidia_close+0x15a/0x2e0 [nvidia]
[437607.054849]  ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[437607.054855]  ? __fput+0xea/0x220
[437607.054859]  ? ____fput+0xe/0x10
[437607.054864]  ? task_work_run+0x8a/0xb0
[437607.054868]  ? do_exit+0x2e9/0xbd0
[437607.054872]  ? do_group_exit+0x43/0xb0
[437607.054875]  ? get_signal+0x169/0x820
[437607.054881]  ? do_signal+0x37/0x730
[437607.054886]  ? do_futex+0x129/0x590
[437607.054893]  ? exit_to_usermode_loop+0x80/0xd0
[437607.054896]  ? do_syscall_64+0x100/0x130
[437607.054903]  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[437607.054905] Code: 11 01 00 00 c7 45 08 00 00 00 00 41 8b bd 74 10 00 00 83 ff 20 0f 84 61 01 00 00 e8 96 e1 bb ff 4c 89 ef 41 89 c6 e8 3b 07 bc ff <8b> 88 64 10 00 00 44 89 e0 4c 89 ef d3 e0 f7 d0 41 21 c6 e8 13 
[437607.055258] RIP: _nv027168rm+0x285/0x430 [nvidia] RSP: ffffa7250813f958
[437607.055260] CR2: 0000000000001064
[437607.055263] ---[ end trace 555ade73743b88c2 ]---
[437607.119214] Fixing recursive fault but reboot is needed!

latest CUDA/cuDNN version is 10.1,but mine version is 10.1,do you think its kennel bug or cuda bug?

609345256 · March 11, 2020, 7:04am

this happeded not in training middle， just staring training, It’s not a problem in use, it’s a problem in startup
thanks

SunilJB · March 13, 2020, 5:23am

Hi,

Can you try upgrading to latest CUDA/cuDNN version?
Also run following commands to get latest kernel:
sudo apt-get update
sudo apt-get dist-upgrade

And then reboot the machine.

Thanks

Topic		Replies	Views
Graphic card got stuck/hang randomly while training a neural network, nvidia-smi return error Linux kernel	0	638	May 12, 2023
Crash with cuda and nvidia 450 Linux drive-cuda	2	724	July 5, 2020
Nvidia-persistenced: Failed to query NVIDIA devices Application Accelerator Software cuda , kernel , ubuntu	8	9601	August 18, 2023
Installing new nvidia drivers and cuda and cudnn on an nvidia geforce 1050 ti? Drivers - Linux, Windows, MacOS cuda , ubuntu , cudnn	2	2727	January 8, 2024
Nvidia drivers hang in nv_rdtsc on CentOS 7 with Quadro K4000 Linux	2	1033	August 25, 2016
POWER8 minsky (S822LC) nvidia stalls and kernel panic CUDA Setup and Installation	0	1078	October 17, 2017
CUDA 10 installation problems on Ubuntu 18.04 CUDA Setup and Installation	24	94533	December 11, 2020
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. on ubuntu 18.04 with NVIDIA Corporation GK110GL [Quadro K5200] Linux	7	3566	October 14, 2021
Nvidia-smi not working after ubuntu kernel update Drivers - Linux, Windows, MacOS kernel , ubuntu , nvidia-smi	7	3096	April 27, 2023
sudden increase in load 4 -> 258 in few minutes with NVidia DNN training CUDA Programming and Performance	0	444	June 19, 2017

GPU stuck during deep learning training

Related topics