Line-by-line profiling for a single-kernel CUDA program

is it possible to do line-by-line run-time profiling with the current CUDA profiling tools? I really like kcachegrind but it is not for CUDA.

I tested nvprof several years ago and was under the impression that it was not possible (only kernel level profiling was available). is it still the case? or I did not look hard enough?

if this is now possible, pointers to tutorials would be appreciated!

nvvp, the visual profiler, can give line by line statistics in some cases.

Instruction execution counts and stall information can be obtained on a line-by-line basis, this is true for both source lines as well as disassembly lines.

To get the best feature set here, I would suggest using CUDA 7.5RC
Also, the features are to a large extent only fully functional on cc5.2 (currently) GPUs, since they depend on new hardware features in the newer GPUs.

You can get more information about it on p17-18 of the CUDA_Profiler_Users_Guide.pdf that ships with CUDA 7.5RC.

thanks a lot! I just upgraded my cuda from 7 to 7.5, now I can use Kernel Profile - PC Sampling feature and see hotspots in the code line level.

Just trying to understand the output: nvvp points out memory dependency is the leading bottleneck, accounting for 56% of the PC samples. However, the top offending lines that account for most of the memory dependencies are only register operations. for example, the second highest memory dependent line is


where v is a register variable with 4 float members (like a float4). The corresponding assembly is

LDL R4, [R3+0xc];
FADD.FTZ R4, R4, 1; # this line has high memory dependency
STL [R3+0xc], R4;

I am curious, how could a register operation cause memory dependency problem?

My kernel is heavy-weighted, using about 80 registers, could this be caused by register spilling? The GPU I profiled on is a GTX 980Ti, and is supposed to hold max 255 registers per thread based on nvvp report.

register spill loads/stores could trigger memory operations on register usage.
In the particular case you show the assembly of, the FADD instruction, using R4, will certainly be dependent on the load of R4 in the previous instruction. A load or store by itself is never a stall (that I can think of). Stall arises when the results of a load are needed by a subsequent instruction.

after playing with --ptxas-options=-v, I found there is no register spilling from my kernel.

the ptxas info for -arch=sm_20

ptxas info    : 0 bytes gmem, 18728 bytes cmem[2]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_20'
ptxas info    : Function properties for _Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_
    224 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 63 registers, 136 bytes cmem[0], 144 bytes cmem[16]

for -arch=sm_52

ptxas info    : 0 bytes gmem, 18728 bytes cmem[3]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_52'
ptxas info    : Function properties for _Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_
    208 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 82 registers, 424 bytes cmem[0], 140 bytes cmem[2]

for the FADD instruction, I imagine loading a register variable is instant (takes one clock?). is that incorrect?

This is a load of R4 from (local) memory:

LDL R4, [R3+0xc];

This is the next instruction:

FADD.FTZ R4, R4, 1;

The above instruction will stall because it is dependent on R4 being retrieved from memory. Once the read transaction issued by the LDL actually completes, and R4 has a valid value, then the next instruction can begin.

Therefore I am not surprised that the profiler would report that the FADD instruction has a high memory dependency. The LDL completes quickly, but that does not mean R4 is populated yet. Since the LDL completes quickly, the FADD gets stalled waiting for R4 to actually get populated with a valid value.

I’m not sure I can make this any clearer.

thanks again for your reply. Although I am not familiar with how GPU assembly is executed, I do see your argument.

nonetheless, I am still not sure why this particular line got picked - it looks simple and innocent to me, and there are plenty of more complex statements throughout the kernel.

what makes this statement so special? this line is located at