CUDA coredump file corrupted

Platform: RTX 5090

export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1

then run my command:

coredump generated, but it’s corrupted:

cuda-gdb

`target cudacore core_1766111917_9180be2fb883_1125.nvcudmp`

Opening GPU coredump: core_1766111917_9180be2fb883_1125.nvcudmp
Failed to read core file: elfGetSectionHeaderStrTblIdx() failed: Section offset out of ELF image bounds

Coredump file is small:

-rw-r--r-- 1 root root 261M Dec 19 10:38 core_1766111917_9180be2fb883_1125.nvcudmp

tail of coredump progress:

[10:43:17.307094] coredump: SM 154/170 is not used by any context
[10:43:17.307101] coredump: SM 155/170 is not used by any context
[10:43:17.307107] coredump: SM 156/170 is not used by any context
[10:43:17.307114] coredump: SM 157/170 is not used by any context
[10:43:17.307120] coredump: SM 158/170 is not used by any context
[10:43:17.307126] coredump: SM 159/170 is not used by any context
[10:43:17.307131] coredump: SM 160/170 is not used by any context
[10:43:17.307136] coredump: SM 161/170 is not used by any context
[10:43:17.307143] coredump: SM 162/170 is not used by any context
[10:43:17.307148] coredump: SM 163/170 is not used by any context
[10:43:17.307153] coredump: SM 164/170 is not used by any context
[10:43:17.307159] coredump: SM 165/170 is not used by any context
[10:43:17.307165] coredump: SM 166/170 is not used by any context
[10:43:17.307171] coredump: SM 167/170 is not used by any context
[10:43:17.307177] coredump: SM 168/170 is not used by any context
[10:43:17.307181] coredump: SM 169/170 is not used by any context
[10:43:17.307185] coredump: SM 170/170 is not used by any context
[10:43:17.307191] coredump: Device 8/8 has finished state collection
[10:43:17.307646] coredump: Calculating ELF file layout
[10:43:17.341004] coredump: ELF file layout calculated
[10:43:17.341016] coredump: Writing ELF file to core_1766112187_9180be2fb883_2117.nvcudmp
[10:43:17.341030] coredump: Current working directory is /mnt/root/workspace/edgeep
[10:43:17.341072] coredump: Writing out global memory (16805299616 bytes)
[10:43:17.526974] coredump: SM 8/170 has finished state collection
[10:43:17.527011] coredump: SM 9/170 has finished state collection
[10:43:17.686841] coredump: SM 10/170 has finished state collection
[10:43:17.686924] coredump: SM 11/170 has finished state collection
[10:43:17.764728] coredump: SM 10/170 has finished state collection
[10:43:17.764745] coredump: SM 11/170 has finished state collection
[10:43:17.829748] coredump: SM 10/170 has finished state collection
[10:43:17.829785] coredump: SM 11/170 has finished state collection
[10:43:17.921470] coredump: SM 10/170 has finished state collection
[10:43:17.921502] coredump: SM 11/170 has finished state collection
[10:43:18.091996] coredump: SM 10/170 has finished state collection
[10:43:18.092026] coredump: SM 11/170 has finished state collection
[10:43:18.125158] coredump: SM 12/170 has finished state collection
[10:43:18.125241] coredump: SM 13/170 has finished state collection
[10:43:18.285244] coredump: SM 12/170 has finished state collection
[10:43:18.285287] coredump: SM 13/170 has finished state collection
[10:43:18.330517] coredump: SM 12/170 has finished state collection
[10:43:18.330548] coredump: SM 13/170 has finished state collection

Hi, @cun.zhang

Sorry for the issue you met.

  1. Have you seen any error or warning message during the coredump geneartion ?
  2. Is it specific to a sample or you can reproduce always ?
  3. Can you provide the sample binary for me to reproduce ?
  1. No error or warning during the coredump generation.
  2. It can be produce always.
  3. I cannot provide you with binary data.

Do you have any suggestions to help troubleshoot this problem? For example, opening a certain system option.

Hi, @cun.zhang

Checking your log, it looks like the generation halted. Can you ensure the application is not killed before the file is collected?

I’m using pytorch and torch extention, maybe SIGTERM or SIGSEGV had been send to this application after a device-side trap and before coredump generation ?