I’m running this simple code:
import os
os.environ['CUDA_COREDUMP_SHOW_PROGRESS'] = '1'
os.environ['CUDA_COREDUMP_FILE'] = '/tmp/cuda_coredump_%h.%p.%t'
os.environ['CUDA_COREDUMP_GENERATION_FLAGS'] = 'skip_global_memory,skip_shared_memory,skip_local_memory'
os.environ['CUDA_ENABLE_COREDUMP_ON_EXCEPTION'] = '1'
import torch
x = torch.zeros(4096, 8192).bfloat16().cuda()
y = torch.zeros(4096 * 128, 8192).bfloat16()
index = torch.tensor([8192 * 8192, 8192 * 8192*2]).cuda().long()
x[index]
torch.cuda.synchronize()
It will trigger `CUDBG_EXCEPTION_WARP_ASSERT` error with cuda core dump.
However, it does not abort the CPU process afer cuda core dump.
In addition, sometimes it cannot produce coredump file. Sometimes it can produce a coredump file, but loading the core dump file in cuda-gdb 12.9 results in a crash:
cuda-gdb/14/gdb/cuda/cuda-state.c:185: internal-error: initialize: Assertion `m_instance.m_num_devices > 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
----- Backtrace -----
0x54234b _Z22gdb_internal_backtracev
0x9932e4 _ZL17internal_vproblemP16internal_problemPKciS2_P13__va_list_tag
0x993508 _Z15internal_verrorPKciS0_P13__va_list_tag
0xb3cac1 _Z18internal_error_locPKciS0_z
0x63a82b _ZN10cuda_state10initializeEv
0x64070a _Z15cuda_initializev
0x6011d8 _Z18cuda_core_load_apiPKc
0x60127d _ZN16cuda_core_targetC1EPKc
0x601803 _ZL21cuda_core_target_openPKci
0x949f62 _ZL11open_targetPKciP16cmd_list_element
0x56ed6f _Z8cmd_funcP16cmd_list_elementPKci
0x95a4be _Z15execute_commandPKci
0x6eee2e _Z15command_handlerPKc
0x6eff8d _Z20command_line_handlerOSt10unique_ptrIcN3gdb13xfree_deleterIcEEE
0x6ef67c _ZL23gdb_rl_callback_handlerPc
0x9d8097 rl_callback_read_char
0x6ee98d _ZL42gdb_rl_callback_read_char_wrapper_noexceptv
0x6ef56d _ZL33gdb_rl_callback_read_char_wrapperPv
0x98e01f _ZL19stdin_event_handleriPv
0xb3d77c _ZL18gdb_wait_for_eventi.part.17
0xb3d902 _Z16gdb_do_one_eventi
0x7d1e56 _ZL21captured_command_loopv
0x7d38a4 _Z8gdb_mainP18captured_main_args
0x44ae64 main
---------------------
cuda-gdb/14/gdb/cuda/cuda-state.c:185: internal-error: initialize: Assertion `m_instance.m_num_devices > 0' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
I manage to reproduce it in B200 GPUs with 575.57.08 driver and cuda 12.9. The core dump works as expected on H100 GPUs with 570.133.20 driver and cuda 12.8.