Line-by-line profiling for a single-kernel CUDA program

FangQ · August 17, 2015, 7:30pm

is it possible to do line-by-line run-time profiling with the current CUDA profiling tools? I really like kcachegrind but it is not for CUDA.

I tested nvprof several years ago and was under the impression that it was not possible (only kernel level profiling was available). is it still the case? or I did not look hard enough?

if this is now possible, pointers to tutorials would be appreciated!

Robert_Crovella · August 17, 2015, 11:31pm

nvvp, the visual profiler, can give line by line statistics in some cases.

Instruction execution counts and stall information can be obtained on a line-by-line basis, this is true for both source lines as well as disassembly lines.

To get the best feature set here, I would suggest using CUDA 7.5RC
Also, the features are to a large extent only fully functional on cc5.2 (currently) GPUs, since they depend on new hardware features in the newer GPUs.

You can get more information about it on p17-18 of the CUDA_Profiler_Users_Guide.pdf that ships with CUDA 7.5RC.

FangQ · August 18, 2015, 6:04pm

thanks a lot! I just upgraded my cuda from 7 to 7.5, now I can use Kernel Profile - PC Sampling feature and see hotspots in the code line level.

Just trying to understand the output: nvvp points out memory dependency is the leading bottleneck, accounting for 56% of the PC samples. However, the top offending lines that account for most of the memory dependencies are only register operations. for example, the second highest memory dependent line is

v.nscat++;

where v is a register variable with 4 float members (like a float4). The corresponding assembly is

LDL R4, [R3+0xc];
FADD.FTZ R4, R4, 1; # this line has high memory dependency
STL [R3+0xc], R4;
SYNC;

I am curious, how could a register operation cause memory dependency problem?

My kernel is heavy-weighted, using about 80 registers, could this be caused by register spilling? The GPU I profiled on is a GTX 980Ti, and is supposed to hold max 255 registers per thread based on nvvp report.

Robert_Crovella · August 18, 2015, 6:28pm

register spill loads/stores could trigger memory operations on register usage.
In the particular case you show the assembly of, the FADD instruction, using R4, will certainly be dependent on the load of R4 in the previous instruction. A load or store by itself is never a stall (that I can think of). Stall arises when the results of a load are needed by a subsequent instruction.

FangQ · August 18, 2015, 9:17pm

after playing with --ptxas-options=-v, I found there is no register spilling from my kernel.

the ptxas info for -arch=sm_20

ptxas info    : 0 bytes gmem, 18728 bytes cmem[2]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_20'
ptxas info    : Function properties for _Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_
    224 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 63 registers, 136 bytes cmem[0], 144 bytes cmem[16]

for -arch=sm_52

ptxas info    : 0 bytes gmem, 18728 bytes cmem[3]
ptxas info    : Compiling entry function '_Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_' for 'sm_52'
ptxas info    : Function properties for _Z13mcx_main_loopPhPfS0_PjP6float4S3_S3_S0_S1_S0_S0_S0_S0_
    208 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 82 registers, 424 bytes cmem[0], 140 bytes cmem[2]

for the FADD instruction, I imagine loading a register variable is instant (takes one clock?). is that incorrect?

Robert_Crovella · August 18, 2015, 9:48pm

This is a load of R4 from (local) memory:

LDL R4, [R3+0xc];

This is the next instruction:

FADD.FTZ R4, R4, 1;

The above instruction will stall because it is dependent on R4 being retrieved from memory. Once the read transaction issued by the LDL actually completes, and R4 has a valid value, then the next instruction can begin.

Therefore I am not surprised that the profiler would report that the FADD instruction has a high memory dependency. The LDL completes quickly, but that does not mean R4 is populated yet. Since the LDL completes quickly, the FADD gets stalled waiting for R4 to actually get populated with a valid value.

I’m not sure I can make this any clearer.

FangQ · August 19, 2015, 4:51am

thanks again for your reply. Although I am not familiar with how GPU assembly is executed, I do see your argument.

nonetheless, I am still not sure why this particular line got picked - it looks simple and innocent to me, and there are plenty of more complex statements throughout the kernel.

what makes this statement so special? this line is located at

github.com

fangq/mcx/blob/master/src/mcx_core.cu#L595


      
          } else if (gcfg->mediaformat == MEDIA_2LABEL_MIX) { //< [s1][c1][c0]: s1: (volume fraction of tissue 1)*(2^16-1), c1: tissue 1 label, c0: tissue 0 label
              union {
                  unsigned int   i;
                  unsigned short h[2];
                  unsigned char  c[4];
              } val;
              val.i = mediaid & MED_MASK;
          
          
    if (val.h[1] > 0) {
                  if ((rand_uniform01(t) * 32767.f) < val.h[1]) {
                      *((float4*)(prop)) = gproperty[val.c[1]];
                      mediaid >>= 8;
                  } else {
                      *((float4*)(prop)) = gproperty[val.c[0]];
                  }
          
          
        mediaid &= 0xFFFF;
              } else {
                  *((float4*)(prop)) = gproperty[val.c[0]];
              }
          } else if (gcfg->mediaformat == MEDIA_ASGN_BYTE) { //< [c3][c2][c1][c0]: c0/c1/c2/c3: interpolation ratios (scaled to 0-255) of mua/mus/g/n between cfg.prop(1,:) and cfg.prop(2,:)

Topic		Replies	Views
CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler Technical Blog	35	2818	September 5, 2021
Profiling GPU at source code level CUDA Programming and Performance	4	700	November 9, 2024
Profiling in a code line resolution CUDA Programming and Performance	7	7217	December 6, 2011
Profiling CUDA Programming and Performance	0	520	August 13, 2015
Instruction-Level Profiling via nvprof? CUDA Programming and Performance	0	1307	January 21, 2016
Strange cudaLaunch stall in NV Visual Profiler CUDA Programming and Performance	1	888	November 29, 2012
nvprof --print-api-trace - puzzling outputs. Visual Profiler and nvprof	0	672	January 7, 2020
Strange cudaLaunch stall in NV Visual Profiler Nsight Eclipse Edition	1	2125	November 29, 2012
Profiling CUDA Programming and Performance	2	895	August 17, 2015
Does nvprof support cudaTextureObject_t? CUDA Programming and Performance	9	753	October 30, 2019

Line-by-line profiling for a single-kernel CUDA program

Related topics