Cuda-gdb segmentation fault when stepping through program

LukasLang · February 19, 2023, 10:37am

I am encountering a crash in cuda-gdb that makes it impossible for me to debug my CUDA kernels. Consider the following C++ code (segfault.cu):

#include <cuda_runtime.h>
#include <cuda.h>

template <typename T>
class cuda_buffer_view
{
public:
	__device__ __host__ cuda_buffer_view();

protected:
	T *buf_;
};

template <typename T>
__device__ __host__ cuda_buffer_view<T>::cuda_buffer_view() : buf_{nullptr}
{
}

int main()
{
	struct task_package
	{
		cuda_buffer_view<task_package> owner{};
	};

	char *a;
	cudaMallocHost(&a, 1, 0u);

	cuda_buffer_view<task_package>{};

	return 0;
}

Now compile it with the following command:

nvcc -ccbin g++ -m64 -g -G -o segfault segfault.cu -lcuda -lculibos

Now, run

cuda-gdb ./segfault

Then enter

break segfault.cu:27
run
n

Doing this, I get the following output:

NVIDIA (R) CUDA Debugger
12.0 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
--Type <RET> for more, q to quit, c to continue without paging--
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./segfault...
(cuda-gdb) break segfault.cu:27
Breakpoint 1 at 0xac04: file segfault.cu, line 27.
(cuda-gdb) run
Starting program: /home/user/Desktop/segfault
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, main () at segfault.cu:27
27              cudaMallocHost(&a, 1, 0u);
(cuda-gdb) n
[New Thread 0x7ffff5475000 (LWP 132589)]
[Detaching after fork from child process 132590]
[New Thread 0x7ffff4948000 (LWP 132599)]
[New Thread 0x7fffe8a14000 (LWP 132600)]
[New Thread 0x7fffe3d9e000 (LWP 132601)]
Segmentation fault (core dumped)

System configuration:

OS: Ubuntu 20.04.5 LTS
CUDA version: 12.0, cuda-gdb 12.1
GPU: RTX A5000

Some things that “fix” the issue:

Setting the breakpoint at the entry point of main and stepping through doesn’t segfault in the minimal example above
Removing __device__ __host__ from the constructor
Using a type other that task_package for the template parameter in line 24/29
Merging the constructor definition & declaration of cuda_buffer_view
Removing the member from cuda_buffer_view

Is this a known issue? What exactly is causing this? Unfortunately, the segfault is not quite as “fragile” in my full application, so I can’t just delete some lines without understanding the parts that are causing the issue(s).
Please let me know if you need any more information.

agontarek · February 20, 2023, 1:17am

Hi @LukasLang,

Thanks for reaching out! This is a known issue and we have a fix in for the upcoming CUDA Toolkit release. Apologies that we didn’t catch this issue in the CUDA Toolkit 12.0 release.

The issue here was due to a change in upstream gdb 12.1 (which we upgraded to with the CUDA Toolkit 12.0 release) behavior exercising a long time bug in cuda-gdb that wasn’t discovered until recently. This bug would only present itself when encountering self referential types in the DWARF.

With the next release you will see expected behavior in your segfault program:

(cuda-gdb) n
[New Thread 0x7ffff2931000 (LWP 31852)]
[Detaching after fork from child process 31853]
[New Thread 0x7ffff1e71000 (LWP 31863)]
[New Thread 0x7ffff118f000 (LWP 31864)]
[New Thread 0x7fffe5fff000 (LWP 31865)]
29		cuda_buffer_view<task_package>{};
(cuda-gdb)

In the meantime, you can try using the cuda-gdb from the CUDA Toolkit 11.8 release or use the workaround you described.

LukasLang · February 21, 2023, 8:17am

Thank you very much for the prompt reply! I am looking forward to a version with a fix.

Thanks to you confirming that the issue is indeed with self-referential types, I have been able to implement a workaround. In case anyone else stumbles across this issue, here’s what I did: I added a new class template opaque<T> that tries to behave like T, but stores the instance of T in a char array:

#pragma once

#include <new>
#include <type_traits>

template <typename T>
struct opaque
{
	using store_t = std::conditional_t<std::is_reference_v<T>, std::reference_wrapper<std::remove_reference_t<T>>, T>;

	char buf[sizeof(store_t)];

	template <typename... A>
	opaque(A &&...args)
	{
		new (buf) store_t{std::forward<A>(args)...};
	}
	opaque &operator=(const T &val)
	{
		*(store_t *)buf = val;
		return *this;
	}
	template <typename T2 = T>
		requires(!std::is_reference_v<T2>)
	opaque &operator=(T &&val)
	{
		*(store_t *)buf = std::move(val);
		return *this;
	}

	~opaque()
	{
		((store_t *)buf)->~store_t();
	}

	operator T()
	{
		return *(store_t *)buf;
	}

	template <typename T2 = T, std::enable_if_t<!std::is_reference_v<T2>, bool> = true>
	store_t *operator->()
	{
		return (store_t *)buf;
	}

	T operator*()
	{
		return *(store_t *)buf;
	}
};

The task_package from my orginal post would then be

struct task_package
{
	opaque<cuda_buffer_view<task_package>> owner{};
};

This way, the type is no longer self-referential in a way that crashes cuda-gdb, while still behaving mostly like the original type. The only changes required are that member variables need to acessed as foo->bar instead of foo.bar, and conversion of opaque<T> into T sometimes needs to be forced with *foo. - It’s not a perfect fix, but at least it allows me to use cuda-gdb again until the fix is released

ndevenish · February 21, 2023, 7:30pm

Thank you! I have also been struggling with this issue for the past week, and only just got around to reducing. It was indeed with a self-referential structure with a pointer-to-function and a function argument pointing to a struct that holds it.

Using cuda-gdb from 11.8 works, and using a void * instead of the self-reference pointer in the (below) HitFunction interface also works.

I include my reduced example here for completeness. The two-step reassignment through the union appears to be necessary for cuda-gdb to try to read the type information, and the SEGV happens on the synchronise, although I believe any runtime API call will do (it was originally a malloc) . Compile with: nvcc -G -o debug debug.cu && cuda-gdb ./debug -ex "b hit_sphere" -ex "r":

#include <cstdio>

union RayObjects;

using HitFunction = void (*)(const RayObjects *hittable);

struct RayObject_Sphere {
    HitFunction hit;
};

union RayObjects {
    RayObject_Sphere sphere;
};

__device__ void hit_sphere(const RayObjects *hittable) {
    const RayObject_Sphere *obj = &hittable->sphere;
}

int main(void) {
    printf("This line prints\n");
    cudaDeviceSynchronize();
    printf("This line doesn't\n");
}

system · March 7, 2023, 7:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cuda-gdb aborted CUDA-GDB	7	144	November 24, 2024
Cuda-gdb doesn't break and/or step into Kernels CUDA Programming and Performance	26	53754	August 1, 2011
Cuda-gdb crash when trying to debug kernel launched through `cudaLaunchCooperativeKernel` CUDA-GDB cuda-gdb	11	2407	April 29, 2024
When using cuda-gdb for debugging in CLion, setting a breakpoint within a kernel function causes CLion to freeze CUDA-GDB	2	365	July 30, 2024
Kernel segfault in the depths of libcuda.so Linux / C++ CUDA Programming and Performance	3	7260	September 12, 2011
Cuda-gdb internal-error of copy_type on basic fortran example CUDA-GDB cuda-gdb	5	978	January 2, 2024
Segmentation Fault in kernel.cu CUDA Programming and Performance	6	5156	July 13, 2011
Segfault using cuda-gdb 12 with cusparseCreate() in a thread GPU-Accelerated Libraries cusparse	2	282	March 5, 2024
weird segfault with cuda-gdb CUDA Programming and Performance	0	6945	January 23, 2010
CUDA-GDB captured "Illegal access to address" exception when I invoke child kernel, but the result is correct when free run CUDA Programming and Performance	6	1720	March 20, 2017

Cuda-gdb segmentation fault when stepping through program

Related topics