Correlation between instructions in ptx and cudaProf


I am trying to find out how does cudaProf populate the instructions field. I know it shows the no. of instructions executed per warp per SM.
I believe that it takes the instructions from the PTX code. But if I count the number of instructions in the PTX code and then look up in cudaProf, they do not match.

Sometimes the instructions are more in PTX and sometimes less. Does anyone know if there is any correlation between the no. of instructions in PTX and that shown in cudaProf? Or, from where does cudaProf count the instructions from?


There should only be a loose, imprecise relationship between ptx instructions and events reported by cudaProf.

This is because PTX is an intermediate virtual instruction set that is compiled down into the GPU’s native instruction set before being executed. This compilation process will change the organization, number, and type of instructions that are actually executed on the GPU and reported by cudaProf.

Then can you provide some insight as to from where does cudaProf gets the information to produce the numbers?

I am under the impression that a certain number of hardware performance counters exist and that cudaProf reads these after a kernel has executed. These performance counters are exposed in PTX via the pmevent instruction and some special registers, but only the process of reading the 32-bit and 64-bit cycle counters is currently documented in the PTX spec. There is definitely enough information in the spec to reverse engineer the other counters (assuming that ptxas will actually generate code for them if you use them in PTX), but no has done so yet. There is also the possibility that nvidia may add better documentation for them in PTX if people ask about them, as they have been pretty good at updating the spec when I have asked about things in the past.

I have managed to predict all counters but instructions in cudaProf by looking at the PTX code. The Instructions counter seems to elude me. Sometimes I overpredict and sometimes I underpredict. Hence, I was looking at a way by which I can predict the Instructions counter specifically in cudaProf. Hope NVIDIA folks document it somewhere.

You could use something like decuda to disassemble the actual hardware instructions, and then count them. That might be a better predictor. PTX (as said) is just an intermediate language, you cannot determine any hardware instruction count from that.

I used decuda on a very simple kernel.

The PTX code is


		.loc	15	  10	  0

		mov.s32		 %r1, 110;

		ld.param.u64	%rd1, [__cudaparm__Z6kernelPi_d_arr];

		cvt.u32.u16	 %r2, %tid.x;

		mov.u16		 %rh1, %ctaid.x;

		mov.u16		 %rh2, %ntid.x;

		mul.wide.u16	%r3, %rh1, %rh2;

		add.u32		 %r4, %r2, %r3;

		cvt.u64.s32	 %rd2, %r4;

		mul.lo.u64	  %rd3, %rd2, 4;

		add.u64		 %rd4, %rd1, %rd3;   [%rd4+0], %r1;

		.loc	15	  19	  0



		} // _Z6kernelPi

and the output of decuda is:

// Disassembling _Z6kernelPi

.entry _Z6kernelPi


.lmem 0

.smem 24

.reg 2

.bar 0

mov.b16 $r0.hi, %ntid.y

cvt.rn.u32.u16 $r1, $r0.lo

mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1

shl.u32 $r0, $r0, 0x00000002

mov.b32 $r1, 0x0000006e

add.u32 $r0, s[0x0010], $r0

mov.end.u32 g[$r0], $r1


The number of instructions in cudaProf is 8 for this kernel. Decuda shows 7 instructions only. Why is this discrepancy. Moreover, there are no load/store instructions in decuda output

The kernel has been compiled such that arguments are passed in shared memory. There’s no explicit parameter load from shared memory because the compiler knows where in shmem that argument lies, and most of the arithmetic instructions let you have one operand in shared memory. It’s used as the first input operand to the mad24 instruction:

mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1.

The global store at the end is represented as a mov from a register to global memory:

mov.end.u32 g[$r0], $r1

But even then the number of instructions is not 8 as shown in cudaProf. Is any of the instruction is to be counted as 2 in the output of decuda?

One more question I had was does cudaprof count the load/store instructions also when it reports the number of instructions or it neglects them and they are taken care of in gld/gst counters?

Thanks again.

I found the number of instructions reported by CudaProf to be slightly different than the numbers I get from simulation. The discrepancies are usually lower than 1%. The other cases are easily explained by the fact that CudaProf gathers statistics on one SM or TPC only, so it is affected by load imbalance.

As for the remaining differences, blame Heisenberg principle. ;)

I suspect this is caused by the profiler itself. Did you try disabling all counters except ‘instructions’ in CudaProf?

Having no idea of the internal implementation, I would expect an implicit exit instruction to hide in there as well…