Cuda-gdb - dynamic parallelism support

How can cuda-gdb support dynamic parallelism?
My problem is that I am trying to figure out cuda-memcheck error in a dynamic parallel cuda code. The error happens in the kernel (second kernel) which is launched from another kernel(first kernel) and both cuda-memcheck and cuda-gdb do not provide information about the first kernel launch. They just simply print the line number in the first kernel launch where second kernel is launched as the error point location which is not accurate.

Is there any way to get some more accurate information?

I have the same question.

cuda-gdb claims to fully support CUDA dynamic parallelism.

You might want to read the manual, and note all references to “dynamic”:

https://docs.nvidia.com/cuda/cuda-gdb/index.html

When I perform a fairly simple test, I can set breakpoints, single-step, and inspect variables in a child kernel, so I’m not sure what the problem is.

It is true that you cannot step from a parent kernel directly into a child kernel launch. But you can set breakpoints in the child kernel, and begin stepping and inspecting variables there.

Here’s a simple fully worked example:

$ cat t291.cu
#include <stdio.h>

__global__ void child(int *data){

  int val = *data;
  val *= 2;
  *data = val;
}

__global__ void parent(int *data){

  child<<<1,1>>>(data);
  cudaDeviceSynchronize();
}

int main(){

  int *d_data, h_data = 2;
  cudaMalloc(&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  parent<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
}
$ nvcc -arch=sm_35 -o t291 t291.cu -g -G -rdc=true -lcudadevrt
$ cuda-gdb ./t291
NVIDIA (R) CUDA Debugger
9.2 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t291...done.
(cuda-gdb) break t291.cu:5
Breakpoint 1 at 0x4033d1: file t291.cu, line 5.
(cuda-gdb) run
Starting program: /home/user2/misc/t291
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffefa08700 (LWP 6418)]
[New Thread 0x7fffef207700 (LWP 6419)]
[Switching focus to CUDA kernel 1, grid -5, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 1, lane 0]

Thread 1 "t291" hit Breakpoint 1, child<<<(1,1,1),(1,1,1)>>> (data=0x7fffbb600000) at t291.cu:5
5         int val = *data;
(cuda-gdb) print val
$1 = <optimized out>
(cuda-gdb) step
6         val *= 2;
(cuda-gdb) print val
$2 = 2
(cuda-gdb) step
7         *data = val;
(cuda-gdb) print val
$3 = 4
(cuda-gdb) continue
Continuing.
[Thread 0x7fffef207700 (LWP 6419) exited]
[Thread 0x7ffff7fcc740 (LWP 6393) exited]
[Inferior 1 (process 6393) exited normally]
(cuda-gdb) quit
$

Note the negative ID on the grid:

[Switching focus to CUDA kernel 1, grid -5, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 1, lane 0]
                                   ^^^^^^^

indicates the kernel where the breakpoint was hit is from a device-side kernel launch.

Hey txbob, thanks for the reply. Apparently the problem is not in cuda-gdb itself, but my child kernel does not launch when compiled with --device-debug (a printf works only when compiled without --device-debug). I checked the ptx, and it looks like the function call is there. Does something like max block size for child kernels somehow change with --device-debug?

no but any number of other code characteristics can, such as register usage, which could effectively limit max block size.

It sounds like your issue has nothing at all to do with cuda-gdb or what it is capable of

If you are using rigorous CUDA error checking both in your host code as well as your device code (you use it in device code exactly the same way you would use it in host code) it should spell out why your child kernels are not launching.