Difference in execution while using cuda-gdb cuda-gdb vs. direct execution

Hello,
While debugging my CUDA code I came across the situation that the last kernel in my code is being called when running directly vs. if I run the same code with cuda-gdb or with CUDA Profiler enabled, that kernel is not shown as being called.
Below is the output from both cuda-gdb and cuda profiler output file:

cuda-gdb:
+++++++++++++++
[Context Create of context 0x165d97f0 on Device 0]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[Launch of CUDA Kernel 0 (prepStruct<<<(1,1,1),(34,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (angV<<<(1,1,1),(72,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (fillUserValues<<<(4,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (prepObj_1<<<(60,60,1),(16,16,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (prepObj_2<<<(60,60,1),(16,16,1)>>>) on Device 0]
[Termination of CUDA Kernel 3 (prepObj_1<<<(60,60,1),(16,16,1)>>>) on Device 0]
couter_SectorArea: nan
couter_SectorAreaRatio: nan

Program exited normally.
(cuda-gdb)

cuda profiler output file “cuda_profile_0.log”:
++++++++++++++

CUDA_PROFILE_LOG_VERSION 2.0

CUDA_DEVICE 0 Tesla M2070

TIMESTAMPFACTOR fffff6c88cc14af0

method,gputime,cputime,occupancy
method=[ memset32_aligned1D ] gputime=[ 4.736 ] cputime=[ 22.000 ] occupancy=[ 1.000 ]
method=[ memcpyHtoD ] gputime=[ 2.048 ] cputime=[ 41.000 ]
method=[ memcpyHtoD ] gputime=[ 1.632 ] cputime=[ 30.000 ]
method=[ memcpyHtoD ] gputime=[ 1.632 ] cputime=[ 29.000 ]
method=[ memcpyHtoD ] gputime=[ 429638.562 ] cputime=[ 429958.000 ]
method=[ _Z13prepStructP10cfg_struct ] gputime=[ 4.288 ] cputime=[ 10.000 ] occupancy=[ 0.042 ]
method=[ memcpyDtoH ] gputime=[ 2.304 ] cputime=[ 36.000 ]
method=[ _Z9angVPdP13SecAng_struct ] gputime=[ 2.144 ] cputime=[ 6.000 ] occupancy=[ 0.062 ]
method=[ memcpyDtoH ] gputime=[ 2.272 ] cputime=[ 36.000 ]
method=[ Z14fillUserValuesP10obj_structPdS1 ] gputime=[ 3.936 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]
method=[ _Z9prepObj_1P10obj_structPdiS1_iP13SecAng_struct ] gputime=[ 4036.096 ] cputime=[ 540.000 ] occupancy=[ 0.500 ]
method=[ _Z14prepObj_2P10obj_structPdiS1_iP13SecAng_struct ] gputime=[ 1733.312 ] cputime=[ 8.000 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 7242.144 ] cputime=[ 13732.000 ]
method=[ memcpyDtoH ] gputime=[ 136024.609 ] cputime=[ 136659.000 ]
method=[ memcpyDtoH ] gputime=[ 268721.844 ] cputime=[ 269382.000 ]
method=[ memcpyDtoH ] gputime=[ 264041.531 ] cputime=[ 264705.000 ]
method=[ memcpyDtoH ] gputime=[ 263697.062 ] cputime=[ 264343.000 ]
method=[ memcpyDtoH ] gputime=[ 261865.031 ] cputime=[ 262520.031 ]

The kernel after prepObj_2 is <<>> (which is also the last kernel in my code) which I verified is called when I run my cuda code directly (added printf statements before and after the call to kernel “insideLoop”).
Both cuda-gdb and cuda profiler file output ends before this last kernel. In fact, cuda-gdb shows “Termination of CUDA Kernel 3”.

My question:
Is there a difference in terms of scheduling of multiple kernels when running with cuda-gdb or in the output of cuda profiler file “cuda_profile_0.log” vs. when running directly ?

System information:
Toolkit 4.0
M2070 on PCIe card on x86 system
CentOS 5.6, 64 bit

Thanks,
Nikhil

I’m not a frequent cuda-gdb user… but I do know that without cuda-gdb/profiler, kernel launches are non-blocking. With cuda-gdb/profiler launches become blocking. If I have not understood those things wrongly, basically it’s like with blocking launch, an effective system synchronization takes place before the next kernel is launched. But I don’t see how this will result in a kernel not being launched… You sure your last kernel launch is not conditional?

I have also observed some other minor differences that are probably related to different block launch overhead/memory write back time… Hope somebody else can clear this up :)

Are you checking the return code of all CUDA calls for errors?

It could be that the launch of the last kernel is failing due to some resource constraints.