cannot step into kernel in sample

Hello, I’m trying to run & debug “vectorAdd” in the samples, but both eclipse nsight and cuda-gdb refuse to step inside the kernel.
cedric@IslaNegra:~/NVIDIA_CUDA-5.0_Samples/0_Simple/vectorAdd$ nvcc -g -G -keep vectorAdd.cu -o vectorAdd
–> works fine
cedric@IslaNegra:~/NVIDIA_CUDA-5.0_Samples/0_Simple/vectorAdd$ cuda-gdb ./vectorAdd
NVIDIA ® CUDA Debugger
5.0 release
Portions Copyright © 2007-2012 NVIDIA Corporation
GNU gdb (GDB) 7.2
Copyright © 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-unknown-linux-gnu”.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>…
Reading symbols from /home/cedric/NVIDIA_CUDA-5.0_Samples/0_Simple/vectorAdd/vectorAdd…done.
(cuda-gdb)
(cuda-gdb) break main
Breakpoint 1 at 0x400c4d: file vectorAdd.cu, line 49.
(cuda-gdb) break vectorAdd
Breakpoint 2 at 0x401316: file vectorAdd.cu, line 33.
(cuda-gdb) run
Starting program: /home/cedric/NVIDIA_CUDA-5.0_Samples/0_Simple/vectorAdd/vectorAdd
[Thread debugging using libthread_db enabled]

Breakpoint 1, main () at vectorAdd.cu:49
49 cudaError_t err = cudaSuccess;
(cuda-gdb) continue
Continuing.
[Vector addition of 50000 elements]
[New Thread 0x7ffff5a85700 (LWP 4031)]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads

Breakpoint 2, vectorAdd (__cuda_0=0x400140000, __cuda_1=0x400170e00,
__cuda_2=0x4001a1c00, __cuda_3=50000) at vectorAdd.cu:33
33 {
(cuda-gdb) step
__device_stub__Z9vectorAddPKfS0_Pfi (__par0=0x400140000, __par1=0x400170e00,
__par2=0x4001a1c00, __par3=50000) at vectorAdd.cudafe1.stub.c:7
7 void __device_stub__Z9vectorAddPKfS0_Pfi(const float *__par0, const float *__par1, float *__par2, int __par3){__cudaSetupArgSimple(__par0, 0UL);__cudaSetupArgSimple(__par1, 8UL);__cudaSetupArgSimple(__par2, 16UL);__cudaSetupArgSimple(__par3, 24UL);__cudaLaunch(((char *)((void ( *)(const float *, const float *, float , int))vectorAdd)));}
(cuda-gdb)
(cuda-gdb) step
cudaLaunch<char> (
func=0x4012fe "UH\211\345SH\203\354(H\211}\350H\211u\340H\211U?M?M\324H\213U\330H\213]\340H\213E\350H\211\336H\211\307\350\024\377\377\377H\203\304([\311\303UH\211\345SH\203\354\070H\211}\350H\213E\350H\211\005\261- ")
at cuda_runtime.h:1072
1072 return cudaLaunch((const void
)func);
(cuda-gdb) step
vectorAdd (__cuda_0=0x400140000, __cuda_1=0x400170e00, __cuda_2=0x4001a1c00,
__cuda_3=50000) at vectorAdd.cu:40
40 }
(stepped out of kernel)
(cuda-gdb) step
main () at vectorAdd.cu:133
133 err = cudaGetLastError();
(cuda-gdb) step
135 if (err != cudaSuccess)
(cuda-gdb) print err
$1 = cudaSuccess
(so it executed correctly, but would not step inside)

what am I doing wrong ?

Can you run the application outside the debugger? Please note you cannot debug on GPU used to draw OS gui (e.g. you would need 2 GPUs to debug from Gnome/KDE/etc.)