Nvidia-smi and nvidia-persistenced hangs with nvidia driver issue on A100

Hi,

We are running hosts on AWS p4d.24xlarge with Nvidia A100 GPUs that initialize nvidia-persistenced and prime GPUs before running a workload.

We’re using Nvidia driver version 470.57.02 on Amazon Linux 1 (Linux 4.14.248-129.473.amzn1.x86_64).

For a fraction of our workloads, we either observe nvidia-persistenced to fail with the error nvidia-persistenced failed to initialize. Check syslog for more details., or otherwise outright hang. In the cases where a hang occurs, we also observe nvidia-smi to hang in a similar fashion. nvidia-bug-report.sh also hangs, prior to generating any meaningful log (even with --safe-mode). In /var/log/messages, we observe the following stack trace:

Nov 19 20:27:48 localhost kernel: [  215.960022] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
Nov 19 20:27:48 localhost kernel: [  215.960268] IP: _nv035089rm+0xac/0x130 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.960269] PGD 8000011b2ec78067 P4D 8000011b2ec78067 PUD 11ca3ced067 PMD 0
Nov 19 20:27:48 localhost kernel: [  215.960271] Oops: 0002 [#1] SMP PTI
Nov 19 20:27:48 localhost kernel: [  215.960273] Modules linked in: efa(O) xt_nat xt_tcpudp veth xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc nvidia_uvm(PO) iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6table_filter ip6_tables x_tables binfmt_misc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) mousedev evdev psmouse drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb fbdev ib_uverbs ib_core drm ipv6 i2c_core crc_ccitt ena pcc_cpufreq button ext4 crc16 mbcache jbd2 fscrypto nvme nvme_core dm_mirror dm_region_hash dm_log
Nov 19 20:27:48 localhost kernel: [  215.960293]  dm_mod dax [last unloaded: efa]
Nov 19 20:27:48 localhost kernel: [  215.960296] CPU: 79 PID: 46835 Comm: nvidia-persiste Tainted: P           O    4.14.248-129.473.amzn1.x86_64 #1
Nov 19 20:27:48 localhost kernel: [  215.960297] Hardware name: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017
Nov 19 20:27:48 localhost kernel: [  215.960298] task: ffff899cb9394000 task.stack: ffffc9001a1ac000
Nov 19 20:27:48 localhost kernel: [  215.960500] RIP: 0010:_nv035089rm+0xac/0x130 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.960501] RSP: 0018:ffffc9001a1af8d8 EFLAGS: 00010206
ov 19 20:27:48 localhost kernel: [  215.960502] RAX: ffff899aeafbde78 RBX: 0000000000000065 RCX: 0000000000000080
Nov 19 20:27:48 localhost kernel: [  215.960502] RDX: 00000000e9875028 RSI: ffff899b3a960008 RDI: 0000000000000014
Nov 19 20:27:48 localhost kernel: [  215.960502] RBP: ffff899aeafbde28 R08: 0000000000000001 R09: 0000000000000000
Nov 19 20:27:48 localhost kernel: [  215.960503] R10: 000000000000b6f3 R11: 0000000000000365 R12: 0000000000000065
Nov 19 20:27:48 localhost kernel: [  215.960503] R13: ffff899b3a960008 R14: ffff899aeaebc008 R15: ffff899b3b4e0008
Nov 19 20:27:48 localhost kernel: [  215.960504] FS:  00007f021e0c2740(0000) GS:ffff899cbd3c0000(0000) knlGS:0000000000000000
Nov 19 20:27:48 localhost kernel: [  215.960504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 19 20:27:48 localhost kernel: [  215.960505] CR2: 0000000000000088 CR3: 0000011cba1ae001 CR4: 00000000007606e0
Nov 19 20:27:48 localhost kernel: [  215.960507] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 19 20:27:48 localhost kernel: [  215.960508] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 19 20:27:48 localhost kernel: [  215.960508] PKRU: 55555554
Nov 19 20:27:48 localhost kernel: [  215.960508] Call Trace:
Nov 19 20:27:48 localhost kernel: [  215.960652]  ? _nv035084rm+0x171/0x3a0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.960772]  ? _nv039441rm+0x82/0xc0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.960891]  ? _nv039441rm+0x4a/0xc0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961038]  ? _nv011611rm+0x49/0x80 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961195]  ? _nv035462rm+0x119/0x210 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961324]  ? _nv011580rm+0x75/0x230 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961481]  ? _nv035462rm+0x119/0x210 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961614]  ? _nv039433rm+0x68/0x170 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961744]  ? _nv039406rm+0x13d/0x350 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.961876]  ? _nv019401rm+0x10/0x10 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962031]  ? _nv023351rm+0x87/0x210 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962035]  ? down+0x12/0x50
Nov 19 20:27:48 localhost kernel: [  215.962153]  ? _nv000763rm+0x23f/0x350 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962268]  ? _nv000714rm+0x6f3/0x22e0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962372]  ? rm_init_adapter+0xc5/0xe0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962435]  ? nv_open_device+0x3d0/0x860 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962499]  ? nvidia_open+0x2f7/0x4d0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962502]  ? kobj_lookup+0x113/0x160
Nov 19 20:27:48 localhost kernel: [  215.962567]  ? nvidia_frontend_open+0x53/0x90 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.962570]  ? chrdev_open+0xb4/0x1a0
Nov 19 20:27:48 localhost kernel: [  215.962571]  ? cdev_put.part.1+0x20/0x20
Nov 19 20:27:48 localhost kernel: [  215.962573]  ? do_dentry_open+0x1ef/0x320
Nov 19 20:27:48 localhost kernel: [  215.962574]  ? __inode_permission+0x85/0xc0
Nov 19 20:27:48 localhost kernel: [  215.962575]  ? path_openat+0x677/0x16b0
Nov 19 20:27:48 localhost kernel: [  215.962576]  ? legitimize_path.isra.33+0x28/0x50
Nov 19 20:27:48 localhost kernel: [  215.962577]  ? unlazy_walk+0x32/0xa0
Nov 19 20:27:48 localhost kernel: [  215.962578]  ? terminate_walk+0x8c/0x100
Nov 19 20:27:48 localhost kernel: [  215.962579]  ? do_filp_open+0x8c/0xf0
Nov 19 20:27:48 localhost kernel: [  215.962580]  ? chown_common.isra.15+0xe7/0x170
Nov 19 20:27:48 localhost kernel: [  215.962580]  ? chown_common.isra.15+0xe7/0x170
Nov 19 20:27:48 localhost kernel: [  215.962583]  ? __alloc_fd+0x44/0x170
Nov 19 20:27:48 localhost kernel: [  215.962584]  ? do_sys_open+0x1a6/0x230
Nov 19 20:27:48 localhost kernel: [  215.962585]  ? do_sys_open+0x1a6/0x230
Nov 19 20:27:48 localhost kernel: [  215.962587]  ? do_syscall_64+0x67/0x110
Nov 19 20:27:48 localhost kernel: [  215.962592]  ? entry_SYSCALL_64_after_hwframe+0x41/0xa6
Nov 19 20:27:48 localhost kernel: [  215.962593] Code: 08 44 89 e0 5b 41 5c c3 0f 1f 80 00 00 00 00 48 c1 e1 06 48 03 8c fe b8 23 00 00 45 84 c0 8b 5008 44 8b 48 0c 74 78 85 db 74 44 <48> 83 41 08 01 0f b6 50 06 83 e2 03 80 fa 03 75 be 45 84 c0 75Nov 19 20:27:48 localhost kernel: [  215.962730] RIP: _nv035089rm+0xac/0x130 [nvidia] RSP: ffffc9001a1af8d8
Nov 19 20:27:48 localhost kernel: [  215.962731] CR2: 0000000000000088
Nov 19 20:27:48 localhost kernel: [  215.962733] ---[ end trace 8aaf2ae4cd161133 ]---Nov 19 20:27:48 localhost kernel: [  215.963903] BUG: unable to handle kernel NULL pointer dereference at 000000000000002d
Nov 19 20:27:48 localhost kernel: [  215.964035] IP: _nv010150rm+0x3c/0x340 [nvidia]Nov 19 20:27:48 localhost kernel: [  215.964037] PGD 0 P4D 0Nov 19 20:27:48 localhost kernel: [  215.964039] Oops: 0000 [#2] SMP PTI
Nov 19 20:27:48 localhost kernel: [  215.964040] Modules linked in: efa(O) xt_nat xt_tcpudp veth xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc nvidia_uvm(PO) iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 nf_nat nf_conntrack ip6table_filter ip6_tables x_tables binfmt_misc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) mousedev evdev psmouse drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea fb fbdev ib_uverbs ib_core drm ipv6 i2c_core crc_ccitt ena pcc_cpufreq button ext4 crc16 mbcache jbd2 fscrypto nvme nvme_core dm_mirror dm_region_hash dm_logNov 19 20:27:48 localhost kernel: [  215.964076]  dm_mod dax [last unloaded: efa]
Nov 19 20:27:48 localhost kernel: [  215.964078] CPU: 31 PID: 46835 Comm: nvidia-persiste Tainted: P      D    O    4.14.248-129.473.amzn1.x86_64 #1
Nov 19 20:27:48 localhost kernel: [  215.964079] Hardware name: Amazon EC2 p4d.24xlarge/, BIOS 1.0 10/16/2017
Nov 19 20:27:48 localhost kernel: [  215.964079] task: ffff899cb9394000 task.stack: ffffc9001a1ac000
Nov 19 20:27:48 localhost kernel: [  215.964250] RIP: 0010:_nv010150rm+0x3c/0x340 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964252] RSP: 0018:ffffc9001a1afcd8 EFLAGS: 00010006
Nov 19 20:27:48 localhost kernel: [  215.964254] RAX: 000000000000002d RBX: ffffc9001a1afd08 RCX: 000000000000002d
Nov 19 20:27:48 localhost kernel: [  215.964255] RDX: ffffc9001a1afd58 RSI: 000000000000b6f3 RDI: ffffffffa26f9638
Nov 19 20:27:48 localhost kernel: [  215.964256] RBP: ffff899aead0b000 R08: ffff899ae9bdfe40 R09: ffffc9001a1afd08
Nov 19 20:27:48 localhost kernel: [  215.964257] R10: ffffc9001a1afcc0 R11: 0000000000000000 R12: ffffffffa10dd608
Nov 19 20:27:48 localhost kernel: [  215.964258] R13: ffff899ca2ec4800 R14: ffff899aead08000 R15: ffff899ca2ec4d28
Nov 19 20:27:48 localhost kernel: [  215.964259] FS:  00007f021e0c2740(0000) GS:ffff899cbcdc0000(0000) knlGS:0000000000000000
Nov 19 20:27:48 localhost kernel: [  215.964260] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 19 20:27:48 localhost kernel: [  215.964261] CR2: 000000000000002d CR3: 0000000001e0a005 CR4: 00000000007606e0
Nov 19 20:27:48 localhost kernel: [  215.964264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 19 20:27:48 localhost kernel: [  215.964265] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 19 20:27:48 localhost kernel: [  215.964266] PKRU: 55555554
Nov 19 20:27:48 localhost kernel: [  215.964266] Call Trace:
Nov 19 20:27:48 localhost kernel: [  215.964367]  ? _nv039714rm+0xb0/0x1a0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964493]  ? rm_shutdown_adapter+0x28/0xa0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964590]  ? nv_close_device+0x16e/0x1b0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964687]  ? nvidia_close_callback+0xa5/0x1c0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964785]  ? nvidia_close+0xe4/0x2d0 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964870]  ? nvidia_frontend_close+0x2a/0x40 [nvidia]
Nov 19 20:27:48 localhost kernel: [  215.964873]  ? __fput+0xca/0x1d0
Nov 19 20:27:48 localhost kernel: [  215.964878]  ? task_work_run+0x8a/0xb0
Nov 19 20:27:48 localhost kernel: [  215.964881]  ? do_exit+0x380/0xb80
Nov 19 20:27:48 localhost kernel: [  215.964884]  ? do_sys_open+0x1a6/0x230
Nov 19 20:27:48 localhost kernel: [  215.964888]  ? rewind_stack_do_exit+0x17/0x20
Nov 19 20:27:48 localhost kernel: [  215.964889] Code: eb 07 0f 1f 44 00 00 31 d2 48 8b 07 48 85 c0 75 1a e9 a1 02 00 00 66 0f 1f 84 00 00 00 00 00 488b 48 10 48 85 c9 74 17 48 89 c8 <48> 39 30 77 ef 0f 83 29 02 00 00 48 8b 48 18 48 85 c9 75 e9 48
Nov 19 20:27:48 localhost kernel: [  215.965073] RIP: _nv010150rm+0x3c/0x340 [nvidia] RSP: ffffc9001a1afcd8
Nov 19 20:27:48 localhost kernel: [  215.965074] CR2: 000000000000002d
Nov 19 20:27:48 localhost kernel: [  215.965075] ---[ end trace 8aaf2ae4cd161134 ]---
Nov 19 20:27:48 localhost kernel: [  215.965076] Fixing recursive fault but reboot is needed!

Any help would be appreciated!