cuda-gdb claims to fully support CUDA dynamic parallelism.
You might want to read the manual, and note all references to “dynamic”:
https://docs.nvidia.com/cuda/cuda-gdb/index.html
When I perform a fairly simple test, I can set breakpoints, single-step, and inspect variables in a child kernel, so I’m not sure what the problem is.
It is true that you cannot step from a parent kernel directly into a child kernel launch. But you can set breakpoints in the child kernel, and begin stepping and inspecting variables there.
Here’s a simple fully worked example:
$ cat t291.cu
#include <stdio.h>
__global__ void child(int *data){
int val = *data;
val *= 2;
*data = val;
}
__global__ void parent(int *data){
child<<<1,1>>>(data);
cudaDeviceSynchronize();
}
int main(){
int *d_data, h_data = 2;
cudaMalloc(&d_data, sizeof(int));
cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
parent<<<1,1>>>(d_data);
cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
}
$ nvcc -arch=sm_35 -o t291 t291.cu -g -G -rdc=true -lcudadevrt
$ cuda-gdb ./t291
NVIDIA (R) CUDA Debugger
9.2 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t291...done.
(cuda-gdb) break t291.cu:5
Breakpoint 1 at 0x4033d1: file t291.cu, line 5.
(cuda-gdb) run
Starting program: /home/user2/misc/t291
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffefa08700 (LWP 6418)]
[New Thread 0x7fffef207700 (LWP 6419)]
[Switching focus to CUDA kernel 1, grid -5, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 1, lane 0]
Thread 1 "t291" hit Breakpoint 1, child<<<(1,1,1),(1,1,1)>>> (data=0x7fffbb600000) at t291.cu:5
5 int val = *data;
(cuda-gdb) print val
$1 = <optimized out>
(cuda-gdb) step
6 val *= 2;
(cuda-gdb) print val
$2 = 2
(cuda-gdb) step
7 *data = val;
(cuda-gdb) print val
$3 = 4
(cuda-gdb) continue
Continuing.
[Thread 0x7fffef207700 (LWP 6419) exited]
[Thread 0x7ffff7fcc740 (LWP 6393) exited]
[Inferior 1 (process 6393) exited normally]
(cuda-gdb) quit
$
Note the negative ID on the grid:
[Switching focus to CUDA kernel 1, grid -5, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 1, lane 0]
^^^^^^^
indicates the kernel where the breakpoint was hit is from a device-side kernel launch.