CUDA coredumps not being generated

I’m trying to generate .nvcudmp core dump files on a linux x86_64 target.
I’ve done the following:

export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
export CUDA_COREDUMP_SHOW_PROGRESS=1

before running running the process. I am getting the following error:

WARNING: NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
Kernel WARNING: NVRM: GPU Board Serial Number: 1324822041760
Kernel WARNING: NVRM: Xid (PCI:0000:25:00): 43, pid=937, name=DebugTask, Ch 00000008

After which the process hangs and i’ll have to close it manually.

Hi @jonathan.cameron,
Thank you for reporting the issue. Could you share additional information to help us identify the exact issus.

  • Please share the nvidia-smi command output.
  • If possible, could you share the new dmesg output (if any) after running the process.
Wed Aug  6 13:54:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:25:00.0 Off |                    0 |
| N/A   34C    P0             26W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:81:00.0 Off |                    0 |
| N/A   34C    P0             38W /  250W |       0MiB /  40960MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

dmesg|tail
[  128.333631] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  128.483879] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  131.459728] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  131.461121] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  147.204642] Application starting, PID 945
[  168.293453] nvidia 0000:25:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_ga10x.bin
[  169.656609] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_tu10x.bin
[  171.143169] NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
[  171.143174] NVRM: GPU Board Serial Number: 1324822041760
[  171.143175] NVRM: Xid (PCI:0000:25:00): 43, pid=945, name=DebugTask, Ch 00000008

Thanks!

Does your app crash (in the GPU code) if you don’t provide the environment variables?

Yes, it is a null pointer dereference, primarily for testing this core dump functionality. It is called through a wrapper (the wrapper is written in C), this is the code that is being called to cause the exception:

__global__ void lib_cuda_CrashKernel(){
   char* p = NULL;
   *p = NULL;
}

Do you have a cudaDeviceSynchronize call in the host code? Can you check whether you can generate coredump with the following application?

#include <stdio.h>

__global__ void lib_cuda_CrashKernel() {
    char *p = NULL;
    *p = NULL;
}

int main(int argc, char **argv) {
    printf("Running\n");
    lib_cuda_CrashKernel<<<32, 1, 1>>>();
    cudaDeviceSynchronize();
    printf("Done\n");
}

We see the following:

% nvcc -g -G test.cu 
% CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 ./a.out 
Running
Starting GPU coredump generation, set the CUDA_COREDUMP_SHOW_PROGRESS environment variable to 1 to enable more detailed output
Aborted (core dumped)

I’ve tried again using the synchronise call in the host code but to no avail. I’ve also tried with the debug flags you compiled with. Are those strictly necessary for generating the .nvcudmp?

For me, the error messages are the same

WARNING: NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
Kernel WARNING: NVRM: GPU Board Serial Number: 1324822041760
Kernel WARNING: NVRM: Xid (PCI:0000:25:00): 43, pid=6413, name=DebugTask, Ch 00000008

dmesg|tail
[  859.011437] [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
[  859.011438] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
[  859.012936] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  859.013650] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  865.077209] Application starting, PID 6413
[  904.966529] nvidia 0000:25:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_ga10x.bin
[  906.328939] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_tu10x.bin
[  907.830838] NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
[  907.830843] NVRM: GPU Board Serial Number: 1324822041760
[  907.830843] NVRM: Xid (PCI:0000:25:00): 43, pid=6413, name=DebugTask, Ch 00000008

A quick search to xid 43:

6.4. Xid 43: Reset Channel Verif Error
This event is logged when a user application hits a software induced fault and must terminate. The
GPU remains in a healthy state.
In most cases, this is not indicative of a driver bug but rather a user application error.

Obviously we can agree on the last statement (user error). Is the driver getting in the way of the coredump being produced?

We will need to investigate the driver part of the coredump generation. Could you provide us with the following information:

  • nvidia-smi -q
  • Nvidia bug report. You can generate it by running the nvidia-bug-report.sh script as root. The script should be available as a part of your CUDA installation (should be present in PATH)

It would take us some time to investigate the issue, we will update this post as soon as we have something.

nvidia_smi_q.txt (22.0 KB)

nvidia-bug-report.log.gz (793.1 KB)

Hi, I’ve attached both pieces of information. Thanks.

Hi @AKravets, have you been able to make any progress investigating this?

Thanks

Hello!
We believe that it might be an issue in our stack. The fix should be available in one of the upcoming CUDA versions. I will update this post as soon as the fixed CUDA version is released.

Hi @AKravets. A colleague asked me to add libcudadebugger to the target system as part of a separate bit of work. I found that this helped the system generate a CUDA coredump.

For the record and anyone who finds this in the future; the system I am using is embedded, so I have been picking and choosing the smallest number of libraries needed and loading them individually. This is instead of installing the entire cuda toolkit to the system, which would use a lot of space.

So, adding libcudadebugger1_575.51.03-1_amd64.deb to the system helped, and the environment variables I exported were:

export CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION=1
export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1
export CUDA_COREDUMP_GENERATION_FLAGS=skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory
export CUDA_COREDUMP_FILE=/tmp/tm500.nvcudmp

resulting in:

user> ls -alF /tmp/tm500.nvcudmp
-rw-r--r--    1 root     0         31928721 Sep 17 13:18 /tmp/tm500.nvcudmp

Cheers.

Hi @jonathan.cameron,
Thank you for the update!

Yes, having the libcudadebugger package is a requirement for the GPU coredump generation. Since the GPU coredumps can be generated with this library present on the system, can I mark the topic as resolved?

Yes, I think this can be marked as resolved.

Thanks.

Thank you! Marking the topic as resolved.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.