Hello,
While debugging my CUDA code I came across the situation that the last kernel in my code is being called when running directly vs. if I run the same code with cuda-gdb or with CUDA Profiler enabled, that kernel is not shown as being called.
Below is the output from both cuda-gdb and cuda profiler output file:
cuda-gdb:
+++++++++++++++
[Context Create of context 0x165d97f0 on Device 0]
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
[Launch of CUDA Kernel 0 (prepStruct<<<(1,1,1),(34,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (angV<<<(1,1,1),(72,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 2 (fillUserValues<<<(4,1,1),(256,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 3 (prepObj_1<<<(60,60,1),(16,16,1)>>>) on Device 0]
[Launch of CUDA Kernel 4 (prepObj_2<<<(60,60,1),(16,16,1)>>>) on Device 0]
[Termination of CUDA Kernel 3 (prepObj_1<<<(60,60,1),(16,16,1)>>>) on Device 0]
couter_SectorArea: nan
couter_SectorAreaRatio: nan
Program exited normally.
(cuda-gdb)
cuda profiler output file “cuda_profile_0.log”:
++++++++++++++
CUDA_PROFILE_LOG_VERSION 2.0
CUDA_DEVICE 0 Tesla M2070
TIMESTAMPFACTOR fffff6c88cc14af0
method,gputime,cputime,occupancy
method=[ memset32_aligned1D ] gputime=[ 4.736 ] cputime=[ 22.000 ] occupancy=[ 1.000 ]
method=[ memcpyHtoD ] gputime=[ 2.048 ] cputime=[ 41.000 ]
method=[ memcpyHtoD ] gputime=[ 1.632 ] cputime=[ 30.000 ]
method=[ memcpyHtoD ] gputime=[ 1.632 ] cputime=[ 29.000 ]
method=[ memcpyHtoD ] gputime=[ 429638.562 ] cputime=[ 429958.000 ]
method=[ _Z13prepStructP10cfg_struct ] gputime=[ 4.288 ] cputime=[ 10.000 ] occupancy=[ 0.042 ]
method=[ memcpyDtoH ] gputime=[ 2.304 ] cputime=[ 36.000 ]
method=[ _Z9angVPdP13SecAng_struct ] gputime=[ 2.144 ] cputime=[ 6.000 ] occupancy=[ 0.062 ]
method=[ memcpyDtoH ] gputime=[ 2.272 ] cputime=[ 36.000 ]
method=[ Z14fillUserValuesP10obj_structPdS1 ] gputime=[ 3.936 ] cputime=[ 6.000 ] occupancy=[ 0.667 ]
method=[ _Z9prepObj_1P10obj_structPdiS1_iP13SecAng_struct ] gputime=[ 4036.096 ] cputime=[ 540.000 ] occupancy=[ 0.500 ]
method=[ _Z14prepObj_2P10obj_structPdiS1_iP13SecAng_struct ] gputime=[ 1733.312 ] cputime=[ 8.000 ] occupancy=[ 1.000 ]
method=[ memcpyDtoH ] gputime=[ 7242.144 ] cputime=[ 13732.000 ]
method=[ memcpyDtoH ] gputime=[ 136024.609 ] cputime=[ 136659.000 ]
method=[ memcpyDtoH ] gputime=[ 268721.844 ] cputime=[ 269382.000 ]
method=[ memcpyDtoH ] gputime=[ 264041.531 ] cputime=[ 264705.000 ]
method=[ memcpyDtoH ] gputime=[ 263697.062 ] cputime=[ 264343.000 ]
method=[ memcpyDtoH ] gputime=[ 261865.031 ] cputime=[ 262520.031 ]
The kernel after prepObj_2 is <<>> (which is also the last kernel in my code) which I verified is called when I run my cuda code directly (added printf statements before and after the call to kernel “insideLoop”).
Both cuda-gdb and cuda profiler file output ends before this last kernel. In fact, cuda-gdb shows “Termination of CUDA Kernel 3”.
My question:
Is there a difference in terms of scheduling of multiple kernels when running with cuda-gdb or in the output of cuda profiler file “cuda_profile_0.log” vs. when running directly ?
System information:
Toolkit 4.0
M2070 on PCIe card on x86 system
CentOS 5.6, 64 bit
Thanks,
Nikhil