Debug cuda kernel code compiled by llvm/clang


Because nvcc lacks certain modern c++ features which my project need, I have to use clang to compile my CUDA code. But I can’t seem to step into the kernel code in cuda-gdb. Here’s a minimal example. I’m compiling this

#include <iostream>

__global__ void axpy(float a, float* x, float* y) {
  y[threadIdx.x] = a * x[threadIdx.x];

int main(int argc, char* argv[]) {
  const int kDataLen = 4;

  float a = 2.0f;
  float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
  float host_y[kDataLen];

  // Copy input data to device.
  float* device_x;
  float* device_y;
  cudaMalloc(&device_x, kDataLen * sizeof(float));
  cudaMalloc(&device_y, kDataLen * sizeof(float));
  cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),

  // Launch the kernel.
  axpy<<<1, kDataLen>>>(a, device_x, device_y);

  // Copy output data to host.
  cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),

  // Print the results.
  for (int i = 0; i < kDataLen; ++i) {
    std::cout << "y[" << i << "] = " << host_y[i] << "\n";

  return 0;


export CUDAPATH=$(dirname $(which nvcc))/../
export CUDALIB=$CUDAPATH/lib64
clang++ -o axpy --cuda-gpu-arch=sm_61 \
    -L$CUDALIB \
    -lcudart_static -ldl -lrt -pthread \

Trying cuda-gdb ./axpy and break at gives:

Thread 1 "axpy" hit Breakpoint 1, 0x000000000094b138 in axpy<<<(1,1,1),(4,1,1)>>> ()
(cuda-gdb) thread 1
[Switching to thread 1 (Thread 0x2aaaabf75c00 (LWP 18082))]
#0  0x00002aaaac1780b8 in cuEGLApiInit () from /usr/lib64/nvidia/
(cuda-gdb) p threadIdx
Cannot access memory at address 0xfffffffffffffff4
(cuda-gdb) where
#0  0x00002aaaac1780b8 in cuEGLApiInit () from /usr/lib64/nvidia/
#1  0x00002aaaac2898bb in cuVDPAUCtxCreate () from /usr/lib64/nvidia/
#2  0x00002aaaac2899f7 in cuVDPAUCtxCreate () from /usr/lib64/nvidia/
#3  0x00002aaaac20c96d in cuVDPAUCtxCreate () from /usr/lib64/nvidia/
#4  0x00002aaaac20cb64 in cuVDPAUCtxCreate () from /usr/lib64/nvidia/
#5  0x00002aaaac12a14c in cudbgApiDetach () from /usr/lib64/nvidia/
#6  0x00002aaaac12a2c1 in cudbgApiDetach () from /usr/lib64/nvidia/
#7  0x00002aaaac06652a in ?? () from /usr/lib64/nvidia/
#8  0x00002aaaac1a22ed in cuCtxSynchronize () from /usr/lib64/nvidia/
#9  0x000000000041401d in cudart::cudaApiDeviceSynchronize() ()
#10 0x00000000004358ed in cudaDeviceSynchronize ()
#11 0x0000000000403c87 in main (argc=1, argv=0x7fffffffc108) at

and I can not print out any variable.

I googled this and found some old topics talking about llvm can’t generate debug info in the binary. Is that still the case now? Can anyone suggest a workaround?

What exactly are the “modern c++ features” that you need, which I don’t really see in your sample code?
Maybe you can try PGI Community, and also check its forums:

This specific question about Clang/LLVM will probably be better answered by its maintainers/forums too.

Hi, it’s c++14 and a little bit of c++17. I can work around the c++17 features but nvcc always has trouble dealing with some edge cases in c++14. For example, compile-time variable template arithmetic doesn’t work in nvcc (I submitted a bug report for this).

Thanks for the suggestion. I checked out PGI compiler and it doesn’t look like a single-compiler solution to me. I seems it still relies on nvcc for device code compilation. Am I right?

Using clang to compile device code is not something that is supported by NVIDIA. There’s nothing wrong with it; it’s a great example of the richness of the CUDA ecosystem, it just doesn’t happen to be a NVIDIA product (just like gnu OpenACC support is not a NVIDIA product, but we’re happy to see it.)

clang, PGI, and gnu toolchains are all part of the “supported” CUDA C/C++ offering (on linux), when used as the host compiler. This is covered in the linux install guide. In this respect, they all effectively use nvcc (or you could say nvcc uses them) to provide a complete compilation solution for host/device CUDA code.

Glad to hear that you filed a bug for the concerns you had around “nvcc” device code compilation C++14 support.

PGI also offers a product around the OpenACC standard. This product can create a binary that will execute on a CUDA-capable GPU. Under the hood it can either use something approximating a full nvcc device-code compilation, but the “default” path is that it generates PTX code under the hood, and uses the ptxas PTX-to-device-code compiler, to generate CUDA-compatible binaries.

of course you are welcome to post questions about clang targeting CUDA devices on these forums. I’ll also point that the llvm/clang docs themselves suggest their own forums for support or questions not answered in the docs:

Studying this:

It seems that clang/CUDA should support a compile option that looks like:


it may be worth a try

Thank you Robert for your explanation! It looks like llvm is a no go for debugging ( As for PGI, it doesn’t compile native CUDA code. What I need is the debug info for the CUDA device code with the c++ features I want. I guess I’m stuck with nvcc for that purpose.