Kernel OOPS: NULL pointer dereference when closing CUDA application KataGo

We are running into what looks like an NVidia driver bug. When running KataGo, an open-source replica of AlphaZero, sometimes when we terminate the application with Ctrl+C it will hang and cannot be killed (even with kill -9), and any other application attempting to use the GPU (e.g. nvidia-smi) will also hang. Concurrently with this, we see a kernel OOPS reported in dmesg, with a NULL pointer dereference in the NVidia driver. For example:

Apr  4 20:28:54 ppo kernel: [1988517.611088] BUG: kernel NULL pointer dereference, address: 00000000000001a0
Apr  4 20:28:54 ppo kernel: [1988517.611147] #PF: supervisor read access in kernel mode
Apr  4 20:28:54 ppo kernel: [1988517.611169] #PF: error_code(0x0000) - not-present page
Apr  4 20:28:54 ppo kernel: [1988517.611188] PGD 161fd0f9067 P4D 161fd0f9067 PUD 161fd41d067 PMD 0
Apr  4 20:28:54 ppo kernel: [1988517.611213] Oops: 0000 [#1] SMP NOPTI
Apr  4 20:28:54 ppo kernel: [1988517.611229] CPU: 200 PID: 2723474 Comm: katago Tainted: P           OE     5.4.0-104-generic #118-Ubuntu
Apr  4 20:28:54 ppo kernel: [1988517.611259] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS 2.3 10/21/2021
Apr  4 20:28:54 ppo kernel: [1988517.611686] RIP: 0010:_nv028129rm+0x4c1/0x5b0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.611710] Code: 85 33 01 00 84 c0 75 0a 41 80 bd 03 05 00 00 00 74 c7 49 8b 85 c0 1a 00 00 4c 89 ef 41 b8 01 00 00 00 b9 01 00 00 00 48 89 da <48> 8b 80 a0 01 00 00 48 8b b0 a0 01 00 00 e8 cc ca f2 ff 85 c0 41
Apr  4 20:28:54 ppo kernel: [1988517.611761] RSP: 0018:ffff9b063583b960 EFLAGS: 00010246
Apr  4 20:28:54 ppo kernel: [1988517.611779] RAX: 0000000000000000 RBX: ffff8ea2c9f3e008 RCX: 0000000000000001
Apr  4 20:28:54 ppo kernel: [1988517.611801] RDX: ffff8ea2c9f3e008 RSI: 0000000000000000 RDI: ffff8ea3b5838008
Apr  4 20:28:54 ppo kernel: [1988517.611823] RBP: ffff8f84a549ad10 R08: 0000000000000001 R09: ffffffffc05c4a00
Apr  4 20:28:54 ppo kernel: [1988517.611844] R10: ffff8ea2c9f38000 R11: 0000000000000001 R12: ffff8ea3b5838008
Apr  4 20:28:54 ppo kernel: [1988517.611865] R13: ffff8ea3b5838008 R14: 0000000000000000 R15: 0000000000000000
Apr  4 20:28:54 ppo kernel: [1988517.611887] FS:  00007fe4b2fea000(0000) GS:ffff8fa43ea00000(0000) knlGS:0000000000000000
Apr  4 20:28:54 ppo kernel: [1988517.611910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  4 20:28:54 ppo kernel: [1988517.611928] CR2: 00000000000001a0 CR3: 00000177c8224003 CR4: 0000000000760ee0
Apr  4 20:28:54 ppo kernel: [1988517.611950] PKRU: 55555554
Apr  4 20:28:54 ppo kernel: [1988517.611959] Call Trace:
Apr  4 20:28:54 ppo kernel: [1988517.612069]  ? _nv028102rm+0x62/0x110 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.612186]  ? _nv002233rm+0x9/0x20 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.612296]  ? _nv003684rm+0x1b/0x70 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.612407]  ? _nv013896rm+0x784/0x7f0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.612487]  ? _nv034512rm+0xac/0xe0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.612601]  ? _nv035938rm+0xb0/0x140 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.613238]  ? _nv035937rm+0x30f/0x4f0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.613728]  ? _nv035932rm+0x60/0x70 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.614202]  ? _nv035933rm+0x7b/0xb0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.614636]  ? _nv034420rm+0x40/0xe0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.615084]  ? _nv000627rm+0x68/0x80 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.615529]  ? rm_cleanup_file_private+0xea/0x170 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.615942]  ? nvidia_close+0x149/0x2d0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.616354]  ? nvidia_frontend_close+0x2f/0x50 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.616720]  ? __fput+0xcc/0x260
Apr  4 20:28:54 ppo kernel: [1988517.617077]  ? ____fput+0xe/0x10
Apr  4 20:28:54 ppo kernel: [1988517.617428]  ? task_work_run+0x8f/0xb0
Apr  4 20:28:54 ppo kernel: [1988517.617767]  ? do_exit+0x36e/0xaf0
Apr  4 20:28:54 ppo kernel: [1988517.618092]  ? do_group_exit+0x47/0xb0
Apr  4 20:28:54 ppo kernel: [1988517.618407]  ? get_signal+0x169/0x890
Apr  4 20:28:54 ppo kernel: [1988517.618712]  ? do_signal+0x34/0x6c0
Apr  4 20:28:54 ppo kernel: [1988517.619006]  ? __x64_sys_futex+0x13f/0x170
Apr  4 20:28:54 ppo kernel: [1988517.619299]  ? exit_to_usermode_loop+0xbf/0x160
Apr  4 20:28:54 ppo kernel: [1988517.619582]  ? do_syscall_64+0x163/0x190
Apr  4 20:28:54 ppo kernel: [1988517.619876]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr  4 20:28:54 ppo kernel: [1988517.620169] Modules linked in: btrfs zstd_compress ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid xt_nat veth nf_conntrack_netlink xt_MASQUERADE bridge stp llc nfnetlink xfrm_user iptable_nat nf_nat xt_owner rpcsec_gss_krb5 auth_rpcgss nfsv4 xt_recent aufs nfsv3 nfs_acl nfs lockd grace fscache overlay bonding binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel wmi_bmof snd_hda_codec_hdmi ipmi_ssif snd_hda_intel snd_seq_midi snd_intel_dspcfg snd_seq_midi_event snd_hda_codec snd_rawmidi snd_hda_core snd_hwdep snd_seq igb snd_pcm ahci ixgbe libahci xfrm_algo mdio dca snd_seq_device ccp snd_timer snd soundcore i2c_piix4 wmi ipmi_si mac_hid sch_fq_codel nvidia_uvm(OE) nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG ipmi_devintf xt_multiport xt_comment ipmi_msghandler xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack
Apr  4 20:28:54 ppo kernel: [1988517.620231]  nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter msr ip6_tables parport_pc iptable_filter ppdev bpfilter lp parport sunrpc ip_tables x_tables autofs4 raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear nvidia_drm(POE) nvidia_modeset(POE) aesni_intel glue_helper crypto_simd cryptd nvidia(POE) ast drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt nvme fb_sys_fops drm nvme_core
Apr  4 20:28:54 ppo kernel: [1988517.624483] CR2: 00000000000001a0
Apr  4 20:28:54 ppo kernel: [1988517.624860] ---[ end trace 81985706cda6be60 ]---
Apr  4 20:28:54 ppo kernel: [1988517.775713] RIP: 0010:_nv028129rm+0x4c1/0x5b0 [nvidia]
Apr  4 20:28:54 ppo kernel: [1988517.776147] Code: 85 33 01 00 84 c0 75 0a 41 80 bd 03 05 00 00 00 74 c7 49 8b 85 c0 1a 00 00 4c 89 ef 41 b8 01 00 00 00 b9 01 00 00 00 48 89 da <48> 8b 80 a0 01 00 00 48 8b b0 a0 01 00 00 e8 cc ca f2 ff 85 c0 41
Apr  4 20:28:54 ppo kernel: [1988517.777045] RSP: 0018:ffff9b063583b960 EFLAGS: 00010246
Apr  4 20:28:54 ppo kernel: [1988517.777497] RAX: 0000000000000000 RBX: ffff8ea2c9f3e008 RCX: 0000000000000001
Apr  4 20:28:54 ppo kernel: [1988517.777951] RDX: ffff8ea2c9f3e008 RSI: 0000000000000000 RDI: ffff8ea3b5838008
Apr  4 20:28:54 ppo kernel: [1988517.778404] RBP: ffff8f84a549ad10 R08: 0000000000000001 R09: ffffffffc05c4a00
Apr  4 20:28:54 ppo kernel: [1988517.778853] R10: ffff8ea2c9f38000 R11: 0000000000000001 R12: ffff8ea3b5838008
Apr  4 20:28:54 ppo kernel: [1988517.779305] R13: ffff8ea3b5838008 R14: 0000000000000000 R15: 0000000000000000
Apr  4 20:28:54 ppo kernel: [1988517.779756] FS:  00007fe4b2fea000(0000) GS:ffff8fa43ea00000(0000) knlGS:0000000000000000
Apr  4 20:28:54 ppo kernel: [1988517.780215] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  4 20:28:54 ppo kernel: [1988517.780678] CR2: 00000000000001a0 CR3: 00000177c8224003 CR4: 0000000000760ee0
Apr  4 20:28:54 ppo kernel: [1988517.781147] PKRU: 55555554
Apr  4 20:28:54 ppo kernel: [1988517.781618] Fixing recursive fault but reboot is needed!

We have replicated this problem on RTX A6000 and RTX A4000 GPUs, and on drivers 510.54 and 470.103. We are running Ubuntu 20.04, kernel 5.4.0-107-generic #121-Ubuntu SMP with drivers installed from the Lambda Stack. We run our application in Docker, with nvidia-container-toolkit. I’ve attached an nvidia-bug-report.log.gz from one of our machines, running 510.54, as well as a complete dmesg from one of the OOPSs.

nvidia-bug-report.log.gz (4.7 MB)
oops.log (321.7 KB)

1 Like

We have replicated this issue on a system with GTX 1080 Ti and Asus motherboard. So it does not seem specific to the platform in any way. Driver is 510.54, kernel is 5.4.0-107-generic #121-Ubuntu SMP.

[1543093.795916] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans mor
e than PCI Bus 0000:00 [mem 0x000c4000-0x000c7fff window]
[1543093.796112] caller os_map_kernel_space.part.0+0x82/0xb0 [nvidia] mapping multiple BARs
[1543094.357255] BUG: kernel NULL pointer dereference, address: 00000000000001a0
[1543094.422968] #PF: supervisor read access in kernel mode
[1543094.474864] #PF: error_code(0x0000) - not-present page
[1543094.519471] PGD 0 P4D 0 
[1543094.561618] Oops: 0000 [#1] SMP PTI
[1543094.603105] CPU: 32 PID: 775839 Comm: cuda-EvtHandlr Tainted: P           OE     5.4.0-107-generic #121-Ubuntu
[1543094.685493] Hardware name: ASUSTeK COMPUTER INC. WS-C621E-SAGE Series/WS-C621E-SAGE Series, BIOS 3501 09/21/2018
[1543094.768154] RIP: 0010:_nv028129rm+0x4c1/0x5b0 [nvidia]
[1543094.808892] Code: 85 33 01 00 84 c0 75 0a 41 80 bd 03 05 00 00 00 74 c7 49 8b 85 c0 1a 00 00 4c 89 ef 41 b8 01 00 00 00 b9 01 00 00 00 48 89 da <48> 8b 80 a0 01 00 00 48 8b b0 a0 01 00 00 e8 cc ca f2 ff 85 c0 41
[1543094.931828] RSP: 0018:ffffb7c53061f960 EFLAGS: 00010246
[1543094.971990] RAX: 0000000000000000 RBX: ffff94412238c008 RCX: 0000000000000001
[1543095.051007] RDX: ffff94412238c008 RSI: 0000000000000000 RDI: ffff94411e898008
[1543095.130249] RBP: ffff9466da9cad10 R08: 0000000000000001 R09: ffffffffc0950a00
[1543095.209508] R10: ffff944122388000 R11: 0000000000000001 R12: ffff94411e898008
[1543095.288587] R13: ffff94411e898008 R14: 0000000000000000 R15: 0000000000000000
[1543095.367992] FS:  00007f46b6d7d700(0000) GS:ffff945721c00000(0000) knlGS:0000000000000000
[1543095.448248] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1543095.488486] CR2: 00000000000001a0 CR3: 0000001d4500a003 CR4: 00000000007606e0
[1543095.566988] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1543095.645867] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1543095.726796] PKRU: 55555554
[1543095.767403] Call Trace:
[1543095.807548]  ? _nv028102rm+0x62/0x110 [nvidia]
[1543095.847232]  ? _nv002233rm+0x9/0x20 [nvidia]
[1543095.886286]  ? _nv003684rm+0x1b/0x70 [nvidia]
[1543095.924334]  ? _nv013896rm+0x784/0x7f0 [nvidia]
[1543095.961739]  ? _nv034512rm+0xac/0xe0 [nvidia]
[1543095.998426]  ? _nv035938rm+0xb0/0x140 [nvidia]
[1543096.034277]  ? _nv035937rm+0x30f/0x4f0 [nvidia]
[1543096.069500]  ? _nv035932rm+0x60/0x70 [nvidia]
[1543096.103972]  ? _nv035933rm+0x7b/0xb0 [nvidia]
[1543096.137529]  ? _nv034420rm+0x40/0xe0 [nvidia]
[1543096.169706]  ? _nv000627rm+0x68/0x80 [nvidia]
[1543096.200747]  ? rm_cleanup_file_private+0xea/0x170 [nvidia]
[1543096.231298]  ? nvidia_close+0x149/0x2d0 [nvidia]
[1543096.261082]  ? nvidia_frontend_close+0x2f/0x50 [nvidia]
[1543096.290347]  ? __fput+0xcc/0x260
[1543096.318479]  ? ____fput+0xe/0x10
[1543096.345530]  ? task_work_run+0x8f/0xb0
[1543096.371817]  ? do_exit+0x36e/0xaf0
[1543096.397014]  ? poll_select_finish+0x210/0x210
[1543096.421541]  ? do_group_exit+0x47/0xb0
[1543096.445143]  ? get_signal+0x169/0x890
[1543096.468095]  ? poll_select_finish+0x210/0x210
[1543096.490859]  ? do_signal+0x34/0x6c0
[1543096.512617]  ? poll_select_finish+0x210/0x210
[1543096.533722]  ? poll_select_finish+0x210/0x210
[1543096.553483]  ? exit_to_usermode_loop+0xbf/0x160
[1543096.572395]  ? do_syscall_64+0x163/0x190
[1543096.590521]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1543096.609195] Modules linked in: cpuid veth nf_conntrack_netlink xt_nat xt_MASQUERADE bridge stp llc nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat aufs overlay binfmt_misc intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp nvidia_uvm(OE) nvidia_drm(POE) coretemp nvidia_modeset(POE) ipmi_ssif kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio rapl joydev nvidia(POE) intel_cstate uas usb_storage snd_hda_intel snd_intel_dspcfg ast snd_hda_codec drm_vram_helper snd_hda_core ttm snd_hwdep drm_kms_helper snd_pcm fb_sys_fops syscopyarea snd_timer sysfillrect snd sysimgblt igb soundcore i2c_i801 i2c_algo_bit lpc_ich mei_me ahci ioatdma mei libahci dca mxm_wmi ipmi_si ipmi_devintf ipmi_msghandler wmi ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt xt_comment ipt_REJECT nf_reject_ipv4 xt_owner xt_limit xt_addrtype xt_tcpudp
[1543096.609228]  xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter sch_fq_codel ip6_tables iptable_filter bpfilter drm ip_tables x_tables autofs4 input_leds hid_generic usbhid hid raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 multipath linear nvme nvme_core mac_hid
[1543097.125165] CR2: 00000000000001a0
[1543097.161601] ---[ end trace 27b52232fd9f90fb ]---
[1543097.274174] RIP: 0010:_nv028129rm+0x4c1/0x5b0 [nvidia]
[1543097.311169] Code: 85 33 01 00 84 c0 75 0a 41 80 bd 03 05 00 00 00 74 c7 49 8b 85 c0 1a 00 00 4c 89 ef 41 b8 01 00 00 00 b9 01 00 00 00 48 89 da <48> 8b 80 a0 01 00 00 48 8b b0 a0 01 00 00 e8 cc ca f2 ff 85 c0 41
[1543097.427371] RSP: 0018:ffffb7c53061f960 EFLAGS: 00010246
[1543097.466838] RAX: 0000000000000000 RBX: ffff94412238c008 RCX: 0000000000000001
[1543097.547999] RDX: ffff94412238c008 RSI: 0000000000000000 RDI: ffff94411e898008
[1543097.631981] RBP: ffff9466da9cad10 R08: 0000000000000001 R09: ffffffffc0950a00
[1543097.716700] R10: ffff944122388000 R11: 0000000000000001 R12: ffff94411e898008
[1543097.800802] R13: ffff94411e898008 R14: 0000000000000000 R15: 0000000000000000
[1543097.884626] FS:  00007f46b6d7d700(0000) GS:ffff945721c00000(0000) knlGS:0000000000000000
[1543097.968271] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1543098.010220] CR2: 00000000000001a0 CR3: 0000001d4500a003 CR4: 00000000007606e0
[1543098.092164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1543098.173951] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1543098.255165] PKRU: 55555554
[1543098.294595] Fixing recursive fault but reboot is needed!

nvidia-bug-report hangs and I cannot reboot this server yet as it has a critical task still running, but attaching the partial output.
nvidia-bug-report.log.gz (3.6 KB)

@AdamGleaveUCB
Please confirm if you are running the same application on all platforms.
If yes, please help to provide reliable repro steps so that I can try to reproduce issue locally which will help in debugging.

@amrits

Yes, same application (KataGo) on all platforms. I’ve made GitHub - HumanCompatibleAI/katago-driver-bug-repro: Docker files to help reproduce bug described in https://forums.developer.nvidia.com/t/kernel-oops-null-pointer-dereference-when-closing-cuda-application-katago/211270/3 containing docker-compose file and instructions on how to replicate.

Unfortunately the issue is stochastic: it occurs sometimes, but not always, when we kill a running KataGo instance. It is more likely to occur the longer it’s been running.

I’ll look into if I can at least automate this (e.g. running and killing it in a loop until it hangs), or find some combination of settings that more reliably triggers the issue.

@amrits

We’ve found a more reliable repro case, that typically triggers the bug in less than 10 minutes. Tested on 470. and 510.60. Available on main branch of GitHub - HumanCompatibleAI/katago-driver-bug-repro: Docker files to help reproduce bug described in https://forums.developer.nvidia.com/t/kernel-oops-null-pointer-dereference-when-closing-cuda-application-katago/211270/3

The bug seems to occur more often when using multiple GPUs, so this test case now assumes a machine with >=7 GPUs.

Do let us know if you have any trouble replicating this.

1 Like

Just wanted to check in to see if you’ve been able to reproduce it with the new code? Do let us know if you have any trouble, happy to provide more details.

@AdamGleaveUCB
Thanks for sharing the code, however I ran it on notebook or system with coupled of GPUs connected but it failed as below -
root@oemqa-ThinkPad-P1-Gen-3:~/katago-driver-bug-repro# bash loop.sh
docker-compose version 1.25.0, build unknown
*** Iteration 0 ***
Starting Docker compose
Waiting for 45 seconds

Done waiting.
Trying to bring docker service down now.
If this hangs, then bug detected!
ERROR: The Compose file ‘./compose/crash.yml’ is invalid because:
services.selfplay.deploy.resources.reservations value Additional properties are not allowed (‘devices’ was unexpected)
services.selfplay.build contains unsupported option: ‘target’
services.selfplay.volumes contains an invalid type, it should be a string
services.selfplay.volumes contains an invalid type, it should be a string
services.selfplay.volumes contains an invalid type, it should be a string
*** Iteration 1 ***
Starting Docker compose
Waiting for 45 seconds
…^C
root@oemqa-ThinkPad-P1-Gen-3:~/katago-driver-bug-repro#

Please confirm if I must need 7 GPUs to run code successfully.
It would be great and easier to have code which can trigger issue with couple of GPUs connected.

Hi @amrits,

I have modified the code to run on fewer GPUs: you can now run bash loop.sh <n> [time] where n can be 2, 3 or 7 GPUs and time is the timeout (defaults to 60s, but you may want to increase it if you don’t see GPU utilization occurring, as I’ve found the start-up time varies depending on how powerful the machine is). You’ll need to rebuild the Docker image to pick up the new configs if running on a system you’ve already run this on: e.g. docker-compose -f compose/crash2.yml --env compose/crash2.env build.

However unfortunately the issue is much harder to reproduce on fewer GPUs. I ran bash loop.sh 3 for 100 iterations without error. However, this issue has occurred at least once with 3 GPUs in-the-wild. So, if you can test it on a larger machine that’d be best, but if this is significantly harder then you could just leave bash loop.sh running in the background (you’ll want to increase the # of iterations from 100) and it should replicate it given enough time.

I’d be happy to give you temporary access to one of our 8-GPU servers to replicate this, though I imagine you need internal infra to debug this properly.

I left it running for four days with three GPUs and was unfortunately unable to replicate this issue, so the number of GPUs unfortunately does seem critical for this test case. Will look into whether I can modify the test case to be more reliable with fewer GPUs, but if the bug involves some race condition in the driver this may be much less likely to occur with a limited number of GPUs.

I tried on a system with 4 x T4 cards but could not reproduced issue so far. Here is the output captured after running script shared on GitHub.

If this hangs, then bug detected!
ERROR: .FileNotFoundError: [Errno 2] No such file or directory: ‘./compose/crash4.yml’
*** Iteration 100 ***
Starting Docker compose
Waiting for 60 seconds

Done waiting.
Trying to bring docker service down now.

Hi @amrits,

Thanks for the attempt to replicate. As the output shows, there’s no compose/crash4.yml file, so the test never ran. The script can’t run on an arbitrary number of GPUs, as we have to provide a different config file depending on the number of GPU devices. Sorry, this should have been documented more clearly.

I’ve added a 4-GPU config now. If running on the same machine again you’ll need to rebuild the Docker image to have it include the new config, e.g:

docker-compose -f compose/crash4.yml --env compose/crash4.env build

We did see this error occur once in the wild on a 4-GPU machine, so replication should be possible, but it seems much lower frequency than with more GPUs. So running it on an 8-GPU machine would still be preferable if you have access to one.

To check things are running correctly, you can look in the logs in bug-repro-logs/active/compose.{stdout,stderr}.

Best,

Adam