Hello! I have been using/examining some Cuda Core dumps.
Any way someone could provide some more info on some of the elf sections in the core dump? For example:
1] .cudbg.global.0 LOUSER+0x2 00007ffbfc000000 00000040
000000028af70000 0000000000000000 A 0 0 0
[ 2] .cudbg.global.1 LOUSER+0x2 00007ffec2400000 28af70040
0000000000200000 0000000000000000 A 0 0 0
[ 3] .cudbg.devtbl LOUSER+0x9 0000000000000000 28b170040
0000000000000050 0000000000000050 A 0 0 0
In some testing code I found that ,cudbg.global.N contains the data for objects mallocd/filled with data with cudaMemAlloc and Memcpy (though the data seems to be shifted, ie. 0xdeadbeef becomes 0xbfadde4f when examined from the coredump directly. There will be a new cudbg.global section for each new object mallocd in this simple test case.
However I am curious why in more complex code this isnt the case, and am a bit lost on deciphering what some of the other sections mean.
In the case of getting a coredump from a Pytorch application where I load a model perform some operations, etc, there are only 2 .cudbg.global sections and a bunch of others that look like:
[ 4] .cudbg.ctxtbl.dev LOUSER+0xa 0000000000000000 28b170090
0000000000000028 0000000000000028 A 3 0 0
[ 5] .cudbg.modtbl.dev LOUSER+0x10 0000000000000000 28b1700b8
00000000000008f8 0000000000000008 A 4 0 0
[99] .cudbg.relfimg.de LOUSER+0x7 0000000000000000 28d7b4800
00000000000003b0 0000000000000000 A 5 46 0
[100] .cudbg.elfimg.dev LOUSER+0x6 0000000000000000 28d7b4bb0
000000000001d368 0000000000000000 A 5 47 0
[580] .cudbg.gridtbl.de LOUSER+0xc 0000000000000000 29ddf1550
0000000000000000 0000000000000068 A 3 0 0
[581] .cudbg.smtbl.dev0 LOUSER+0xb 0000000000000000 29ddf1550
00000000000000e0 0000000000000008 A 3 0 0
[582] .cudbg.ctatbl.dev LOUSER+0xd 0000000000000000 29ddf1630
0000000000000000 0000000000000018 A 581 0 0
The above enumerates all the unique sections I could find and am wondering how they fit into the coredump/can be used to get back our original data.