Cuda-gdb segmentation fault when stepping through program

I am encountering a crash in cuda-gdb that makes it impossible for me to debug my CUDA kernels. Consider the following C++ code (

#include <cuda_runtime.h>
#include <cuda.h>

template <typename T>
class cuda_buffer_view
	__device__ __host__ cuda_buffer_view();

	T *buf_;

template <typename T>
__device__ __host__ cuda_buffer_view<T>::cuda_buffer_view() : buf_{nullptr}

int main()
	struct task_package
		cuda_buffer_view<task_package> owner{};

	char *a;
	cudaMallocHost(&a, 1, 0u);


	return 0;

Now compile it with the following command:

nvcc -ccbin g++ -m64 -g -G -o segfault -lcuda -lculibos

Now, run

cuda-gdb ./segfault

Then enter


Doing this, I get the following output:

NVIDIA (R) CUDA Debugger
12.0 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
--Type <RET> for more, q to quit, c to continue without paging--
Find the GDB manual and other documentation resources online at:

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./segfault...
(cuda-gdb) break
Breakpoint 1 at 0xac04: file, line 27.
(cuda-gdb) run
Starting program: /home/user/Desktop/segfault
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/".

Breakpoint 1, main () at
27              cudaMallocHost(&a, 1, 0u);
(cuda-gdb) n
[New Thread 0x7ffff5475000 (LWP 132589)]
[Detaching after fork from child process 132590]
[New Thread 0x7ffff4948000 (LWP 132599)]
[New Thread 0x7fffe8a14000 (LWP 132600)]
[New Thread 0x7fffe3d9e000 (LWP 132601)]
Segmentation fault (core dumped)

System configuration:

  • OS: Ubuntu 20.04.5 LTS
  • CUDA version: 12.0, cuda-gdb 12.1
  • GPU: RTX A5000

Some things that “fix” the issue:

  • Setting the breakpoint at the entry point of main and stepping through doesn’t segfault in the minimal example above
  • Removing __device__ __host__ from the constructor
  • Using a type other that task_package for the template parameter in line 24/29
  • Merging the constructor definition & declaration of cuda_buffer_view
  • Removing the member from cuda_buffer_view

Is this a known issue? What exactly is causing this? Unfortunately, the segfault is not quite as “fragile” in my full application, so I can’t just delete some lines without understanding the parts that are causing the issue(s).
Please let me know if you need any more information.

Hi @LukasLang,

Thanks for reaching out! This is a known issue and we have a fix in for the upcoming CUDA Toolkit release. Apologies that we didn’t catch this issue in the CUDA Toolkit 12.0 release.

The issue here was due to a change in upstream gdb 12.1 (which we upgraded to with the CUDA Toolkit 12.0 release) behavior exercising a long time bug in cuda-gdb that wasn’t discovered until recently. This bug would only present itself when encountering self referential types in the DWARF.

With the next release you will see expected behavior in your segfault program:

(cuda-gdb) n
[New Thread 0x7ffff2931000 (LWP 31852)]
[Detaching after fork from child process 31853]
[New Thread 0x7ffff1e71000 (LWP 31863)]
[New Thread 0x7ffff118f000 (LWP 31864)]
[New Thread 0x7fffe5fff000 (LWP 31865)]
29		cuda_buffer_view<task_package>{};

In the meantime, you can try using the cuda-gdb from the CUDA Toolkit 11.8 release or use the workaround you described.

Thank you very much for the prompt reply! I am looking forward to a version with a fix.

Thanks to you confirming that the issue is indeed with self-referential types, I have been able to implement a workaround. In case anyone else stumbles across this issue, here’s what I did: I added a new class template opaque<T> that tries to behave like T, but stores the instance of T in a char array:

#pragma once

#include <new>
#include <type_traits>

template <typename T>
struct opaque
	using store_t = std::conditional_t<std::is_reference_v<T>, std::reference_wrapper<std::remove_reference_t<T>>, T>;

	char buf[sizeof(store_t)];

	template <typename... A>
	opaque(A &&...args)
		new (buf) store_t{std::forward<A>(args)...};
	opaque &operator=(const T &val)
		*(store_t *)buf = val;
		return *this;
	template <typename T2 = T>
	opaque &operator=(T &&val)
		*(store_t *)buf = std::move(val);
		return *this;

		((store_t *)buf)->~store_t();

	operator T()
		return *(store_t *)buf;

	template <typename T2 = T, std::enable_if_t<!std::is_reference_v<T2>, bool> = true>
	store_t *operator->()
		return (store_t *)buf;

	T operator*()
		return *(store_t *)buf;

The task_package from my orginal post would then be

struct task_package
	opaque<cuda_buffer_view<task_package>> owner{};

This way, the type is no longer self-referential in a way that crashes cuda-gdb, while still behaving mostly like the original type. The only changes required are that member variables need to acessed as foo->bar instead of, and conversion of opaque<T> into T sometimes needs to be forced with *foo. - It’s not a perfect fix, but at least it allows me to use cuda-gdb again until the fix is released

Thank you! I have also been struggling with this issue for the past week, and only just got around to reducing. It was indeed with a self-referential structure with a pointer-to-function and a function argument pointing to a struct that holds it.

Using cuda-gdb from 11.8 works, and using a void * instead of the self-reference pointer in the (below) HitFunction interface also works.

I include my reduced example here for completeness. The two-step reassignment through the union appears to be necessary for cuda-gdb to try to read the type information, and the SEGV happens on the synchronise, although I believe any runtime API call will do (it was originally a malloc) . Compile with: nvcc -G -o debug && cuda-gdb ./debug -ex "b hit_sphere" -ex "r":

#include <cstdio>

union RayObjects;

using HitFunction = void (*)(const RayObjects *hittable);

struct RayObject_Sphere {
    HitFunction hit;

union RayObjects {
    RayObject_Sphere sphere;

__device__ void hit_sphere(const RayObjects *hittable) {
    const RayObject_Sphere *obj = &hittable->sphere;

int main(void) {
    printf("This line prints\n");
    printf("This line doesn't\n");

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.