Manually take Memory Dump?

Anyway currently to generate a snapshot dump manually for a given process using Cuda without intentionally causing an exception?

Hi @davido
There is a mechanism to trigger coredump generation manually:

Enabling user induced GPU core dump generation

For the devices that support compute preemption, the user can interrupt a running CUDA process to generate the GPU core dump.

Set the CUDA_ENABLE_USER_TRIGGERED_COREDUMP environment variable to 1 in order to enable generating a user induced GPU core dump. This option is disabled by default. Setting this environment variable will open a communication pipe for each subsequently running CUDA process. To induce the GPU core dump, the user simply writes to the pipe.

To change the default pipe file name, set the CUDA_COREDUMP_PIPE environment variable to a specific pipe name. The default pipe name is in the following format: corepipe.cuda.HOSTNAME.PID where HOSTNAME is the host name of machine running the CUDA application and PID is the process identifier of the CUDA application. This environment variable can take % specifiers as decribed in the above section.

More details can be found in cuda-gdb documentation: CUDA-GDB (search for CUDA_ENABLE_USER_TRIGGERED_COREDUMP). To trigger the coredump you need to write anything to the pipe (not just open and close it).

Note that coredump generation operation is destructive, so your app will be killed in process. We don’t support taking snapshots while application is running yet.

Thank you for your response! Is there any way you can elaborate on this?
I have the environment variable CUDA_ENABLE_USER_TRIGGERED_COREDUMP set using both export CUDA_ENABLE_USER_TRIGGERED_COREDUMP = 1 as well as adding that environmental variable definition in my runtime configurations for the python process I am trying to debug. I attached to that python process by PID in cuda-gdb but it seems like there is no corepipe in my working directory(ie. corepipe.cuda.HOSTNAME.PID) and I cant write to that pipe.
An example would be greatly appreciated.

Hi @davido,
The following sequence worked for me:

  • Run CUDA application (I used radixSortThrust CUDA sample modified to run in while (true) loop):
CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1 ./radixSortThrust
  • Pipe file is created in current directory:
ls corepipe_akravets-dt_81675
  • Write 1 to that file: echo 1 > corepipe_akravets-dt_81706
  • Note that the CUDA process terminated
Sorting 1048576 32-bit unsigned int keys and values

zsh: abort (core dumped)  CUDA_ENABLE_USER_TRIGGERED_COREDUMP=1 ./radixSortThrust
% ls core_1636543265_akravets-dt_81706.nvcudmp
1 Like

Thanks! This helped me get farther than last time however I am still unable to reproduce what it happening here.

Starting any program, I do see the core pipe getting created now but writing to the pipe seems to do nothing. I have tried a few ways, including:

  1. Running a program and throwing a while true loop in there, then writing to the pipe via echo 1 > corepipe_david .
  2. Launching a program with Cuda GDB, printing out the address of some GPU data structure pointer (ie. 0xb012345678) then letting the program continue, catching in a while true loop then writing echo 1 > corepipe_david.
  3. Running the radixSortThrust program with a while true loop inside of it (like in your test case). And writing echo 1 > corepipe_david while its caught in the loop.

None of these things worked on my setup, though getting to the point where the corepipe is made actually works now (only after changing display settings in bios to not have display automatically set to a device).

Any ideas? Ultimately I would like to be able to inspect a data structure I moved over to the GPU, so this would be useful to figure out.

Hi @davido
Could you provide a bit more info about your setup?

  • The output of the nvidia-smi command.
  • Did you add the while(true) loop on host (so kernel is launched in a loop) or on GPU (so there is an infinite loop inside the kernel)?
  • If on host - are you running any kernels inside the loop body?

In my case, I have updated the radixSortThrust as follows:

    while(true)
    {
        if (checkCmdLineFlag(argc, (const char **)argv, "float"))
            bTestResult = testSort<float, true>(argc, argv);
        else
            bTestResult = testSort<unsigned int, false>(argc, argv);
    }

The while true loop was added in my test program on the host in this way:

	while(true)
	{
	// Converting image to grayscale...
		convertToGrayScale <<<dimGrid, dimBlock >>>(dev_pOut, dev_pIn, dev_consts, width, height, channels);
		printf("Looping...\n");
	}

Where convertToGrayScale is defined as:

__global__ void convertToGrayScale(float * pOut, float * pIn, float * consts, int width, int height, int channels) 
{
...
}

My program works, if I just let it run normally and I am trying to use this as a simple case to inspect a data structure I moved into GPU memory.

My Nvidia-smi is shown below. Though I am actually using Cuda-10.1 as it is necessary for my actual project.
±----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro M2000 Off | 00000000:01:00.0 Off | N/A |
| 56% 47C P0 24W / 75W | 176MiB / 4043MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Quadro M2000 Off | 00000000:05:00.0 Off | N/A |
| 56% 35C P8 8W / 75W | 9MiB / 4043MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1110 G /usr/lib/xorg/Xorg 32MiB |
| 0 N/A N/A 1806 G /usr/lib/xorg/Xorg 50MiB |
| 0 N/A N/A 2011 G /usr/bin/gnome-shell 81MiB |
| 1 N/A N/A 1110 G /usr/lib/xorg/Xorg 2MiB |
| 1 N/A N/A 1806 G /usr/lib/xorg/Xorg 2MiB |
±----------------------------------------------------------------------------+

That is my nvidia-smi output, though I am doing all of this through TTY on ubuntu as when I dont xorg seems to crash my display when I use cuda-gdb on anything. My cuda-gdb version is:

10.1 release
Portions Copyright (C) 2007-2019 NVIDIA Corporation
GNU gdb (GDB) 7.12

in case that helps, and my nvcc version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Thank you for helping so far!

Edit:
Not sure how Cuda actually catches writing to the pipe and generating that core dump, but should intentionally causing a bug within my cuda kernel invoke that core dump also being made? Maybe its an issue with communication through the pipe?

Hi @davido,
The nvidia-smi output points to r495 driver (CUDA 11.5), but you mentioned, that you are actually using CUDA-10.1 - can you provide more details on that? Are you using libcuda.so.1 from CUDA 10.1 installation (e.g. by setting LD_LIBRARY_PATH to CUDA-10.1 libcuda.so.1 location?

Could you also run deviceQuery sample using your CUDA 10.1 setup and share the output?

Some context - there are multiple components involved here:

  • Kernel driver (detected as r495 by nvidia-smi)
  • libcuda.so - depending on your setup can be either from CUDA 11.5 or CUDA 10.1 - need to confirm that
  • Tools (compiler, cuda-gdb) - tools from CUDA 10.1 are used.