Debugging device code does not work


I’m programming a shared library (with several source files) containing several cuda kernels. I’d like to debug the device code, but it’s not working and I can’t figure out why…

I’m compiling the source files which contain cuda code with the -g -G options. After that I’m creating a shared library object from all the object files and move the shared library in a different folder (but I’m using the “directory” command in gdb, in order to make sure that gdb knows where the source files are)

I’m able to debug host code, but gdb does not stop in kernel code.
Does anyone know what I’m doing wrong? I tried googling, but I didn’t find a solution and I’m getting more and more frustrated…

Thanks in advance for any help!

Some additional information:

  • I have 2 graphic cards and I'm using one of them for my displays and the other solely for CUDA stuff (I'm able to debug simple programs)
  • I'm using CUDA 5.0 on ubuntu 12.04
  • My Kernel is using some inline functions

When I try to step through a Kernel, I get the following output from cuda-gdb:
first step:

namesp::__wrapper__device_stub_myKernel<namesp::someclass> (__cuda_0=@0x7fffa0fbb018, 
    __cuda_1=@0x7fffa0fbb010, __cuda_2=@0x7fffa0fbb008, __cuda_3=@0x7fffa0fbb004, __cuda_4=@0x7fffa0fbb000, 
    __cuda_5=@0x7fffa0fbaffc, __cuda_6=@0x7fffa0fbb040, __cuda_7=@0x7fffa0fbb048, __cuda_8=@0x7fffa0fbb050, 
    __cuda_9=@0x7fffa0fbaff8, __cuda_10=@0x7fffa0fbb058, __cuda_11=@0x7fffa0fbaff4, __cuda_12=@0x7fffa0fbaff0, 
    __cuda_13=@0x7fffa0fbafec, __cuda_14=@0x7fffa0fbafe8, __cuda_15=@0x7fffa0fbb060, __cuda_16=@0x7fffa0fbb068)
    at mySource.cudafe1.stub.c:662
662	template<> __specialization_static void __wrapper__device_stub_myKernel< ::namesp::someclass>(  _ZN10namesp12someclassE *&__cuda_0, _ZN10namesp14someclassMSE *&__cuda_1,uint32_t *&__cuda_2,const uint32_t &__cuda_3,const uint32_t &__cuda_4,const uint32_t &__cuda_5,const uint32_t &__cuda_6,unsigned *&__cuda_7,const int &__cuda_8,const float &__cuda_9,const int &__cuda_10,const float &__cuda_11,const float &__cuda_12,const float &__cuda_13,const float &__cuda_14,float *&__cuda_15,const uint32_t &__cuda_16){__device_stub__ZN10namesp19myKernelINS_12someclassEEEvPT_PNS_1someclassMSEPjjjjjS6_ififfffPfj( __cuda_0,__cuda_1,__cuda_2,__cuda_3,__cuda_4,__cuda_5,__cuda_6,__cuda_7,__cuda_8,__cuda_9,__cuda_10,__cuda_11,__cuda_12,__cuda_13,__cuda_14,__cuda_15,__cuda_16);}}

second step:

__device_stub__ZN10namesp19myKernelINS_12someclassEEEvPT_PNS_14someclassMSEPjjjjjS6_ififfffPfj (
    __par0=0x5008c0000, __par1=0x5007f0000, __par2=0x5006c0000, __par3=3, __par4=10, __par5=512, __par6=7680, 
    __par7=0x5006c3000, __par8=73, __par9=0.300000012, __par10=1392, __par11=1319.91235, __par12=1323.49597, 
    __par13=714.734375, __par14=487.081604, __par15=0x501720000, __par16=128) at mySource.cudafe1.stub.c:660
660	static void __device_stub__ZN10namesp19myKernelINS_12someclassEEEvPT_PNS_14someclassMSEPjjjjjS6_ififfffPfj( _ZN10namesp12someclassE *__par0,  _ZN10namesp14someclassMSE *__par1, uint32_t *__par2, const uint32_t __par3, const uint32_t __par4, const uint32_t __par5, const uint32_t __par6, unsigned *__par7, const int __par8, const float __par9, const int __par10, const float __par11, const float __par12, const float __par13, const float __par14, float *__par15, const uint32_t __par16){__cudaSetupArgSimple(__par0, 0UL);__cudaSetupArgSimple(__par1, 8UL);__cudaSetupArgSimple(__par2, 16UL);__cudaSetupArgSimple(__par3, 24UL);__cudaSetupArgSimple(__par4, 28UL);__cudaSetupArgSimple(__par5, 32UL);__cudaSetupArgSimple(__par6, 36UL);__cudaSetupArgSimple(__par7, 40UL);__cudaSetupArgSimple(__par8, 48UL);__cudaSetupArgSimple(__par9, 52UL);__cudaSetupArgSimple(__par10, 56UL);__cudaSetupArgSimple(__par11, 60UL);__cudaSetupArgSimple(__par12, 64UL);__cudaSetupArgSimple(__par13, 68UL);__cudaSetupArgSimple(__par14, 72UL);__cudaSetupArgSimple(__par15, 80UL);__cudaSetupArgSimple(__par16, 88UL);__cudaLaunch(((char *)((void ( *)( _ZN10namesp12someclassE *,  _ZN10namesp14someclassMSE *, uint32_t *, const uint32_t, const uint32_t, const uint32_t, const uint32_t, unsigned *, const int, const float, const int, const float, const float, const float, const float, float *, const uint32_t))namesp::myKernel<namesp::someclass> )));}namespace namesp{

when I step again, I’m back in my host code. During the steps, cuda-gdb tells me: “Focus not set on any active CUDA kernel.” (when using the “cuda kernel” command in gdb)

As you can see, I’m using c+±templates, but I already tried the same without templates and debugging didn’t work either.

I also tried reducing the block-size, but I still could not debug.

Any suggestions?

Haven’t used cuda-gdb, so pardon my ignorance. Perhaps the problem resides in the fact that the code is in a shared library. If you can, try and see if you can make some of it as a standalone executable and see if you can debug it that way. Before you even do that, try cuda-gdb with a simple ‘hello world’ type example and see if that works and go from there. Hope it helps. :)

Thanks for your comment!
I forgot to mention that I successfully debugged a kernel in a shared library already (I wrote a small test program and a shared library for this purpose), so debugging a shared library shouldn’t be a general problem.
I’m loading the symbols with the “sharedlibrary”-command and in my test program/shared library this works fine.

I think I figured out, what the problem is:

I’m using set auto-solib-add off in my .cuda-gdbinit-file.

cuda-gdb can’t debug kernels with auto-solib-add off on start-up of the executable.
When I set auto-solib-add on after start-up, debugging of kernels still does not work, even when loading all symbols of all shared libraries.
In this case (when loading the shared symbols of all shared libraries) I get the following error when continuing:

The CUDA driver has hit an internal error.
Error code: 0x19007a00000001c
Further execution or debugging is unreliable.

I tried “set stop-on-solib-events 1” to make sure there is no shared library which is loaded on start-up and unloaded directly afterwards.

cuda-gdb obviously does not work with auto-solib-add off, which is pretty annoying in my case, as I don’t want cuda-gdb to load the symbols of ALL shared libraries (because it slows down the execution of my application dramatically (~100 shared libraries)).

I hope NVIDIA will fix this problem in the near future.


Thanks for posting the issue.

To be able to use ‘set auto-solib-add off’ with currently released versions of cuda-gdb, you can do the following:

(1) Compile your application to explicitly link against libcuda. Example: nvcc -g -G -o myapp -lcuda
(2) Set a breakpoint at main (‘break main’), then type ‘run’.
(3) Type ‘sharedlibrary libcuda’
(4) Proceed with the debug session. You should now be able to break into kernels and use the debugger as normal.

We will look into an automatic solution (to avoid the manual workaround above) for a future release of cuda-gdb.

Thanks a lot for your reply, geoffg.

I tried your workaround and it works perfectly fine! Thanks for this very useful hint.
An automatic solution would be better, but the workaround is much better than nothing!

Another feature that would be great to have is python support for cuda-gdb.