Nvcudmp sections

davido · March 8, 2022, 8:02pm

Hello! I have been using/examining some Cuda Core dumps.
Any way someone could provide some more info on some of the elf sections in the core dump? For example:
1] .cudbg.global.0 LOUSER+0x2 00007ffbfc000000 00000040
000000028af70000 0000000000000000 A 0 0 0
[ 2] .cudbg.global.1 LOUSER+0x2 00007ffec2400000 28af70040
0000000000200000 0000000000000000 A 0 0 0
[ 3] .cudbg.devtbl LOUSER+0x9 0000000000000000 28b170040
0000000000000050 0000000000000050 A 0 0 0
In some testing code I found that ,cudbg.global.N contains the data for objects mallocd/filled with data with cudaMemAlloc and Memcpy (though the data seems to be shifted, ie. 0xdeadbeef becomes 0xbfadde4f when examined from the coredump directly. There will be a new cudbg.global section for each new object mallocd in this simple test case.
However I am curious why in more complex code this isnt the case, and am a bit lost on deciphering what some of the other sections mean.

In the case of getting a coredump from a Pytorch application where I load a model perform some operations, etc, there are only 2 .cudbg.global sections and a bunch of others that look like:
[ 4] .cudbg.ctxtbl.dev LOUSER+0xa 0000000000000000 28b170090
0000000000000028 0000000000000028 A 3 0 0
[ 5] .cudbg.modtbl.dev LOUSER+0x10 0000000000000000 28b1700b8
00000000000008f8 0000000000000008 A 4 0 0
[99] .cudbg.relfimg.de LOUSER+0x7 0000000000000000 28d7b4800
00000000000003b0 0000000000000000 A 5 46 0
[100] .cudbg.elfimg.dev LOUSER+0x6 0000000000000000 28d7b4bb0
000000000001d368 0000000000000000 A 5 47 0
[580] .cudbg.gridtbl.de LOUSER+0xc 0000000000000000 29ddf1550
0000000000000000 0000000000000068 A 3 0 0
[581] .cudbg.smtbl.dev0 LOUSER+0xb 0000000000000000 29ddf1550
00000000000000e0 0000000000000008 A 3 0 0
[582] .cudbg.ctatbl.dev LOUSER+0xd 0000000000000000 29ddf1630
0000000000000000 0000000000000018 A 581 0 0

The above enumerates all the unique sections I could find and am wondering how they fit into the coredump/can be used to get back our original data.

AKravets · March 11, 2022, 11:56am

Hi @davido,
You could start by looking at the cudacoredump.h file (should be part of your CUDA Toolkit/cuda-gdb distribution) which defines various sections and data structures for entries.

davido · March 14, 2022, 5:22pm

Thank you, this was helpful!
Any idea why my data is shifted though in the way I described in the original post?
“ie. 0xdeadbeef becomes 0xbfadde4f when examined from the coredump directly”

AKravets · March 15, 2022, 7:30am

Hi @davido,
You can also check the published cuda-gdb sources (e.g. for 11.6u1): https://developer.download.nvidia.com/compute/cuda/opensource/11.6.1/

Here you can take a look at libcudacore library:

/**
 * \file libcudacore.h
 * \brief API for reading CUDA core files.
 *
 * This header file describes the library's interface for opening and
 * reading core files.
 */

You you want to dig deeper into the memory layout in the core file, you can examine the library sources (for example, grep for CUDBG_SHT_GLOBAL_MEM)