Cuda-gdb doesn't break and/or step into Kernels

Here’s what I do:

[codebox]$ nvcc -g -G main.cu -o main

$ cuda-gdb main

(cuda-gdb) break KernelName

Breakpoint 1 at 0x805725a: file regressionKernel.cu, line 17.

(cuda-gdb) run

Starting program: /home/letharion/cuda/a.out

[Thread debugging using libthread_db enabled]

[New Thread -1211300144 (LWP 12837)]

Program exited normally.

[/codebox]

The compiling instructions is done like instructed on page 5 in the cuda-gdb manual.

When I request a break, my interpretation of the response is that my Kernel has been found. Line 17 is the first line of code inside “KernelName”.

When I then run my program, I would expect the program to break inside the Kernel, but as you can see that doesn’t happen.

I can also break at main, and next my way to the Kernel invocation. From there I try to step my way inside, but fail because I’m only shown two lines of dim3() and then execution proceeds after the kernel.

What am I doing wrong?

Do you have a GPU with debugging capabilities? However, I think that if you don’t then you get a message that says something like: “GPU with no debugging capabiliteis,” but I am just checking.

“Breakpoint 1 at 0x805725a: file regressionKernel.cu, line 17” implies that you’re compiling things at different times. This doesn’t work–all of your CUDA files need to be compiled together in a single step in order for the debugger to work.

I’m having the same problems. Any advice?

I do get the warning below which I override, Xwindows has been stopped via “/etc/init.d/gdm stop” and I’m connecting via NX remote screen.

ERROR: Debugging a GPU attached to a Window system is not supported and may hang the system.

We have detected that DISPLAY is set which indicates a window system is running.

To override this detection, set the environment variable CUDA_GDB_DISPLAY_OVERRIDE to any value.

Ubuntu, 2.6.24-23-generic.

nvcc: release 2.1, V0.2.1221

Card: GeForce 9400 GT

Compile line:

nvcc -g -G Main.cu -I /opt/NVIDIA_CUDA_SDK/common/inc/ -o Main.out

cuda-gdb session.

cuda-gdb -q Main.out 

Using host libthread_db library "/lib/tls/i686/cmov/libthread_db.so.1".

(cuda-gdb) break incrementArrayOnDevice

Breakpoint 1 at 0x805c80e: file Main.cu, line 15.

(cuda-gdb) run

Starting program: Main.out 

[Thread debugging using libthread_db enabled]

[New process 12286]

[New Thread -1211861312 (LWP 12286)]

Program exited normally.

(cuda-gdb)

Code ref: http://www.ddj.com/hpc-high-performance-computing/207402986

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

void incrementArrayOnHost(float *a, int N)

{

  int i;

  for (i=0; i < N; i++) a[i] = a[i]+1.f;

}

__global__ void incrementArrayOnDevice(float *a, int N)

{

  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  if (idx<N) a[idx] = a[idx]+1.f;

}

int main(void)

{

  float *a_h, *b_h;		   // pointers to host memory

  float *a_d;				 // pointer to device memory

  int i, N = 10;

  size_t size = N*sizeof(float);

  // allocate arrays on host

  a_h = (float *)malloc(size);

  b_h = (float *)malloc(size);

  // allocate array on device 

  cudaMalloc((void **) &a_d, size);

  // initialization of host data

  for (i=0; i<N; i++) a_h[i] = (float)i;

  // copy data from host to device

  cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);

  // do calculation on host

  incrementArrayOnHost(a_h, N);

  // do calculation on device:

  // Part 1 of 2. Compute execution configuration

  int blockSize = 4;

  int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

  // Part 2 of 2. Call incrementArrayOnDevice kernel 

  incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

  // Retrieve result from device and store in b_h

  cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  // check results

  for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);

  // cleanup

  free(a_h); free(b_h); cudaFree(a_d); 

}

After much trial and error, it was the ‘if’ statement in the kernel. I remember reading something about di/convergence of conditionals but does that mean you can’t debug a kernel (on device, not emu) with conditionals?

Try reducing thread block-size!

Just wondering was this issue ever resolved properly?

I had this same problem with nvcc from “Cuda compilation tools, release 3.1, V0.2.1221” (and a Fermi card), though I never had this problem with nvcc from v2.3 (and Tesla). Using RedHat Linux. No matter how I built the executable or set the breakpoints, cuda-gdb just would not step in to a kernel, but jumped right over it (as if I typed “next” instead of “step”, and as if no breakpoints were set in there). cuda-gdb was happy enough to accept breakpoints based on the kernel name, and on line numbers inside the kernel. There were no “if” statements in the kernel either. It didn’t make much sense anyway that an “if” statement would throw off a debugger so completely.

I solved the problem (for my case at least) by reducing the block-size from 1024 (32x32) to 256 (16x16). That would explain “everything”, only 1024 threads is supposed to be allowed…

Try reducing thread block-size!

Just wondering was this issue ever resolved properly?

I had this same problem with nvcc from “Cuda compilation tools, release 3.1, V0.2.1221” (and a Fermi card), though I never had this problem with nvcc from v2.3 (and Tesla). Using RedHat Linux. No matter how I built the executable or set the breakpoints, cuda-gdb just would not step in to a kernel, but jumped right over it (as if I typed “next” instead of “step”, and as if no breakpoints were set in there). cuda-gdb was happy enough to accept breakpoints based on the kernel name, and on line numbers inside the kernel. There were no “if” statements in the kernel either. It didn’t make much sense anyway that an “if” statement would throw off a debugger so completely.

I solved the problem (for my case at least) by reducing the block-size from 1024 (32x32) to 256 (16x16). That would explain “everything”, only 1024 threads is supposed to be allowed…

I’m suffering from the same problem both on my code as well as simplified kernels. I’ve tried the walkthrough example from the cuda-gdb manual, but debugging kernels just does not work.

When trying to replicate the steps from the manual (page 20), at step 4, when breaking in the kernel I get this output:

(cuda-gdb) c

Continuing.

Breakpoint 2, bitreverse (__cuda_0=0xfc00000000) at bitreverse.cu:8

8	__global__ void bitreverse(void *data) {

which obviously shows that something is not right. At the same time, if I try any of the CUDA-specific gdb info features I get nothing but this message:

(cuda-gdb) info cuda threads 

Focus not set on any running CUDA kernel.

I’d desperately need debugging, so could somebody explain what’s going on?

GPU: GTX 470

OS: Ubuntu 10.04.1 LTS X86_64, 2.6.32-25-generic

Driver: 260.19.21

CUDA: v3.2 release

I’m suffering from the same problem both on my code as well as simplified kernels. I’ve tried the walkthrough example from the cuda-gdb manual, but debugging kernels just does not work.

When trying to replicate the steps from the manual (page 20), at step 4, when breaking in the kernel I get this output:

(cuda-gdb) c

Continuing.

Breakpoint 2, bitreverse (__cuda_0=0xfc00000000) at bitreverse.cu:8

8	__global__ void bitreverse(void *data) {

which obviously shows that something is not right. At the same time, if I try any of the CUDA-specific gdb info features I get nothing but this message:

(cuda-gdb) info cuda threads 

Focus not set on any running CUDA kernel.

I’d desperately need debugging, so could somebody explain what’s going on?

GPU: GTX 470

OS: Ubuntu 10.04.1 LTS X86_64, 2.6.32-25-generic

Driver: 260.19.21

CUDA: v3.2 release

Hello,

I’m having the exact same experience. Have you had any further luck with this?

GPUs: GT 430 and GTX 480

OS: Ubuntu 10.04.1 LTS X86_64, 2.6.32-27-generic

Driver: 260.19.26

CUDA: v3.2 release

Cheers,

Christian

Make sure your app is compiled with -g -G options.

When the breakpoint is hit cuda-gdb focus (active stack) can be on the host thread in that case you need to switch to the device(GPU) by first viewing the list of all kernels and then explicitly switching to device as follows. Then you can do info cuda threads to see all the device threads.

Switch to CUDA kernel:
(cuda-gdb) info cuda kernels

  • 0 Device 0 bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at
    bitreverse.cu:9
    (cuda-gdb) cuda kernel 0
    [Switching to CUDA Kernel 0 (<<<(0,0),(0,0,0)>>>)]
    #0 bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at bitreverse.cu:9
    9 unsigned int idata = (unsigned int)data;
    (cuda-gdb) bt
    #0 bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at bitreverse.cu:9

Also please review the latest 3.2 cuda-gdb @ CUDA Toolkit Documentation

I’ve been having the same problem. I’ve tried to use it on linux, switching off X server, using a GTX-480. cuda-gdb seems to detect the CUDA device. I can set breakpoints at the beginning of the kernel, but then it continues right till the end of the kernel and continues to the host code. It is not possible to switch to kernel code at all using the gdb commands.

However, I think, stepping into the code, perhaps again deceptively, seemed to work once, at which point it delved into unfathomable depths as I’m using a lot of library calls.

What I recall is, I set a breakpoint in the middle of the kernel and absolutely nothing happened then.

Could this be a problem with the single GPU setup?

This is the latest 3.2 cuda-gdb that I’m using. Any recommendations after when the breakpoint at the kernel function is reached? (i.e. after “break test_kernel” and “run”)

Regards,

Eray

I can set breakpoints at the beginning of the kernel, but then it continues right till the end of the kernel and continues to the host code.

This is typically a symptom of app not having the debug symbols. Is the app compiled with nvcc -g -G flags?

Yes, I’ve used those flags. I can try it out again in a bit, I’m using the latest Ubuntu 10.10 distribution, and this is about the only major toolkit bug I’ve seen. Interestingly, a similar behavior was seen on OS X, while I was running Aqua, that’s why I thought maybe it is not detecting the GPU properly, but then the info command did show a single GPU. I could never see any threads with the info cmds though. Let me try this once more on Linux, just to make sure.

Ok I’ve reinstalled everything on another 64-bit Ubuntu 10.10 system, and no, it’s not possible to step into kernels, and although the CUDA program runs correctly, cuda-gdb cannot detect any kernels or cuda threads or set focus on anything. The only command that works is info cuda system, which correctly reports my GF100. I’ve compiled with -arch=sm_20 -g -G

That is to say issuing

cuda kernel 0

doesn’t work.

Bump!

I still can’t debug at all and now that I have to track down some fairly tricky bugs, printf-ing has proved to just consume lots of time and help very little. Has anyone figured out workaround/fix?

I suppose for the start you should try the new CUDA 4.0 release. I will in due time.

Right, that was my hope as well, forgot to mention that I just tried it today and it does not work. It seems to happily step trough the CPU-code generated by nvcc, but it does not step into the kernel. Really strange…

I am suffering from the same problem, too, have you filed a bug report about it?

What is your platform and exact hardware? There may be troublesome configurations that the developers may not have had the chance to test. I was last using the OS X cuda-gdb and linux cuda-gdb from CUDA 3.2 on GTX-480 with no success. Exactly the same problem, you can’t step into the GPU code, can’t trace, only CPU threads worked, although the debugger seems to fire up all right.