how to use cuda-gdb core dump

isaaclee2313 · December 21, 2018, 7:23am

I want to enable cuda-gdb core dump on exception.

From reading the documentation, I have tried to set CUDA_ENABLE_COREDUMP_ON_EXCEPTION to 1 by typing this in the terminal outside of cuda-gdb:

export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1

Then I have opened the program in cuda-gdb and ran the program. It hit a SIGINT caused by assert(false). However, there is no message about any core dump being created. also, I don’t know where the file is supposed to be if it were created.

Is how I enabled core dump correct?
is SIGNIT not supposed to generate core dump?
where would the file be, and how can I change the default path fo the file?

Robert_Crovella · December 21, 2018, 2:05pm

Conceptually the process is at a high level similar to how you would use an “ordinary” CPU coredump.

You enable coredump
You run your program normally (not in cuda-gdb)
your program hits some kind of coredump fault (and exits, depositing a coredump file on disk)
you then start up cuda-gdb
you don’t open your own program, but instead you open the coredump file

[url]https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-coredump[/url]

isaaclee2313 · December 31, 2018, 4:21am

I got the program to produce coredump, but I can’t locate the dumped file.

I have read the link you have attached thoroughly, but there is still no mention of:

default path of coredump files
how to change the default path

Robert_Crovella · December 31, 2018, 6:34am

How do you know you got the program to produce coredump if you can’t locate the dumped file?

Anyway, none of this seems to be obscure. Here’s a full test case:

$ export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
$ cat t365.cu
__global__ void k(int *d){

  int *x = NULL;
  *d = *x;
}

int main(){

  int *data;
  cudaMalloc(&data, sizeof(int));
  k<<<1,1>>>(data);
  cudaDeviceSynchronize();
}
$ nvcc -o t365 t365.cu
$ ls core*
ls: cannot access core*: No such file or directory
$ ./t365

Message from syslogd@dc11 at Dec 31 01:31:38 ...
 kernel:NMI watchdog: BUG: soft lockup - CPU#18 stuck for 23s! [t365:7703]
Aborted (core dumped)
$ ls core*
core_1546237863_dc11.dc.nvidia.com_7688.nvcudmp
$

I see instructions on changing the name of the coredump file in the documentation. The path is the same path as your executable uses. I don’t see instructions to change the default coredump path to something other than the path to your executable.

This seems very straightforward to me. I’m not sure what the issue is.

dev845 · March 1, 2021, 4:46pm

Hi,

Here are my steps:

my2:~/test> export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
my2:~/test> cat t365.cu

__global__ void k(int *d){

  int *x = NULL;
  *d = *x;
}

int main(){

  int *data;
  cudaMalloc(&data, sizeof(int));
  k<<<1,1>>>(data);
  cudaDeviceSynchronize();
}
my2:~/test> nvcc -o t365 t365.cu
my2:~/test> ls core*
ls: cannot access 'core*': No such file or directory
my2:~/test> ./t365
my2:~/test> ls core*
ls: cannot access 'core*': No such file or directory
my2:~/test> sudo tail -n 10 /var/log/messages | grep "NVRM"
2021-03-01T19:31:51.946564+03:00 my2 kernel: [ 2325.089234] NVRM: Xid (PCI:0000:01:00): 43, pid=7830, Ch 00000068

So I have an NVRM message in the log file, but I don’t have a core dump file and I don’t have any messages when run program.

How do I enable core dump file?

user122022 · December 28, 2023, 3:27am

When I use cuda-gdb to open the coredump file with suffix nvcudmp , I got the following error:
xxx is not a core dump: file format not recognized

and I also failed to open it using nsys, ncu or vscode