CUDA coredumps not being generated

jonathan.cameron · August 6, 2025, 12:39pm

I’m trying to generate .nvcudmp core dump files on a linux x86_64 target.
I’ve done the following:

export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
export CUDA_COREDUMP_SHOW_PROGRESS=1

before running running the process. I am getting the following error:

WARNING: NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
Kernel WARNING: NVRM: GPU Board Serial Number: 1324822041760
Kernel WARNING: NVRM: Xid (PCI:0000:25:00): 43, pid=937, name=DebugTask, Ch 00000008

After which the process hangs and i’ll have to close it manually.

AKravets · August 6, 2025, 1:52pm

Hi @jonathan.cameron,
Thank you for reporting the issue. Could you share additional information to help us identify the exact issus.

Please share the nvidia-smi command output.
If possible, could you share the new dmesg output (if any) after running the process.

jonathan.cameron · August 6, 2025, 2:11pm

Wed Aug  6 13:54:59 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:25:00.0 Off |                    0 |
| N/A   34C    P0             26W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:81:00.0 Off |                    0 |
| N/A   34C    P0             38W /  250W |       0MiB /  40960MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

dmesg|tail
[  128.333631] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  128.483879] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  131.459728] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  131.461121] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  147.204642] Application starting, PID 945
[  168.293453] nvidia 0000:25:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_ga10x.bin
[  169.656609] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_tu10x.bin
[  171.143169] NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
[  171.143174] NVRM: GPU Board Serial Number: 1324822041760
[  171.143175] NVRM: Xid (PCI:0000:25:00): 43, pid=945, name=DebugTask, Ch 00000008

Thanks!

AKravets · August 6, 2025, 3:26pm

Does your app crash (in the GPU code) if you don’t provide the environment variables?

jonathan.cameron · August 6, 2025, 3:32pm

Yes, it is a null pointer dereference, primarily for testing this core dump functionality. It is called through a wrapper (the wrapper is written in C), this is the code that is being called to cause the exception:

__global__ void lib_cuda_CrashKernel(){
   char* p = NULL;
   *p = NULL;
}

AKravets · August 6, 2025, 4:01pm

Do you have a cudaDeviceSynchronize call in the host code? Can you check whether you can generate coredump with the following application?

#include <stdio.h>

__global__ void lib_cuda_CrashKernel() {
    char *p = NULL;
    *p = NULL;
}

int main(int argc, char **argv) {
    printf("Running\n");
    lib_cuda_CrashKernel<<<32, 1, 1>>>();
    cudaDeviceSynchronize();
    printf("Done\n");
}

We see the following:

% nvcc -g -G test.cu 
% CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 ./a.out 
Running
Starting GPU coredump generation, set the CUDA_COREDUMP_SHOW_PROGRESS environment variable to 1 to enable more detailed output
Aborted (core dumped)

jonathan.cameron · August 6, 2025, 4:52pm

I’ve tried again using the synchronise call in the host code but to no avail. I’ve also tried with the debug flags you compiled with. Are those strictly necessary for generating the .nvcudmp?

For me, the error messages are the same

WARNING: NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
Kernel WARNING: NVRM: GPU Board Serial Number: 1324822041760
Kernel WARNING: NVRM: Xid (PCI:0000:25:00): 43, pid=6413, name=DebugTask, Ch 00000008

dmesg|tail
[  859.011437] [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
[  859.011438] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
[  859.012936] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  859.013650] EXT4-fs (sda1): re-mounted. Quota mode: none.
[  865.077209] Application starting, PID 6413
[  904.966529] nvidia 0000:25:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_ga10x.bin
[  906.328939] nvidia 0000:81:00.0: firmware: direct-loading firmware nvidia/575.51.03/gsp_tu10x.bin
[  907.830838] NVRM: GPU at PCI:0000:25:00: GPU-821d1cf4-6282-ee29-b730-079c723aa071
[  907.830843] NVRM: GPU Board Serial Number: 1324822041760
[  907.830843] NVRM: Xid (PCI:0000:25:00): 43, pid=6413, name=DebugTask, Ch 00000008

A quick search to xid 43:

6.4. Xid 43: Reset Channel Verif Error
This event is logged when a user application hits a software induced fault and must terminate. The
GPU remains in a healthy state.
In most cases, this is not indicative of a driver bug but rather a user application error.

Obviously we can agree on the last statement (user error). Is the driver getting in the way of the coredump being produced?

AKravets · August 7, 2025, 9:10am

We will need to investigate the driver part of the coredump generation. Could you provide us with the following information:

nvidia-smi -q
Nvidia bug report. You can generate it by running the nvidia-bug-report.sh script as root. The script should be available as a part of your CUDA installation (should be present in PATH)

It would take us some time to investigate the issue, we will update this post as soon as we have something.

jonathan.cameron · August 7, 2025, 9:42am

nvidia_smi_q.txt (22.0 KB)

nvidia-bug-report.log.gz (793.1 KB)

Hi, I’ve attached both pieces of information. Thanks.

jonathan.cameron · September 2, 2025, 8:14am

Hi @AKravets, have you been able to make any progress investigating this?

Thanks

AKravets · September 2, 2025, 3:02pm

Hello!
We believe that it might be an issue in our stack. The fix should be available in one of the upcoming CUDA versions. I will update this post as soon as the fixed CUDA version is released.

jonathan.cameron · September 17, 2025, 1:23pm

Hi @AKravets. A colleague asked me to add libcudadebugger to the target system as part of a separate bit of work. I found that this helped the system generate a CUDA coredump.

For the record and anyone who finds this in the future; the system I am using is embedded, so I have been picking and choosing the smallest number of libraries needed and loading them individually. This is instead of installing the entire cuda toolkit to the system, which would use a lot of space.

So, adding libcudadebugger1_575.51.03-1_amd64.deb to the system helped, and the environment variables I exported were:

export CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION=1
export CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
export CUDA_COREDUMP_SHOW_PROGRESS=1
export CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1
export CUDA_COREDUMP_GENERATION_FLAGS=skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory
export CUDA_COREDUMP_FILE=/tmp/tm500.nvcudmp

resulting in:

user> ls -alF /tmp/tm500.nvcudmp
-rw-r--r--    1 root     0         31928721 Sep 17 13:18 /tmp/tm500.nvcudmp

Cheers.

AKravets · September 17, 2025, 3:19pm

Hi @jonathan.cameron,
Thank you for the update!

Yes, having the libcudadebugger package is a requirement for the GPU coredump generation. Since the GPU coredumps can be generated with this library present on the system, can I mark the topic as resolved?

jonathan.cameron · September 18, 2025, 9:09am

Yes, I think this can be marked as resolved.

Thanks.

AKravets · September 18, 2025, 9:23am

Thank you! Marking the topic as resolved.

system · October 2, 2025, 9:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
how to use cuda-gdb core dump CUDA Programming and Performance	6	4607	December 28, 2023
CUDA coredump file corrupted CUDA-GDB cuda	4	58	December 22, 2025
Manually take Memory Dump? CUDA-GDB	7	4576	November 17, 2021
Cuda core dump does not work properly when many device assert happens CUDA Programming and Performance cuda-gdb	2	219	December 4, 2025
Is there any sample code for generating core dump? CUDA Programming and Performance	2	791	October 12, 2021
Why my gdb shows ?() when i use CUDA COREDUMP? CUDA-GDB	3	67	December 9, 2025
Cannot generate cuda coredump when cuda kernel access illegal address CUDA-GDB	8	1754	December 9, 2021
Cuda gdb cannot load coredump file, error: Assertion `m_instance.m_num_devices > 0' failed CUDA-GDB cuda-gdb	2	324	June 27, 2025
CUDA coredump support in MPS environment CUDA Programming and Performance	0	487	July 27, 2017
Can we specify a CUDA core dump location? CUDA Programming and Performance	4	1005	October 12, 2021

CUDA coredumps not being generated

Related topics