Correlation between instructions in ptx and cudaProf

mayank · June 23, 2010, 3:04am

Hi,

I am trying to find out how does cudaProf populate the instructions field. I know it shows the no. of instructions executed per warp per SM.
I believe that it takes the instructions from the PTX code. But if I count the number of instructions in the PTX code and then look up in cudaProf, they do not match.

Sometimes the instructions are more in PTX and sometimes less. Does anyone know if there is any correlation between the no. of instructions in PTX and that shown in cudaProf? Or, from where does cudaProf count the instructions from?

Thanks

Gregory_Diamos · June 23, 2010, 7:07am

There should only be a loose, imprecise relationship between ptx instructions and events reported by cudaProf.

This is because PTX is an intermediate virtual instruction set that is compiled down into the GPU’s native instruction set before being executed. This compilation process will change the organization, number, and type of instructions that are actually executed on the GPU and reported by cudaProf.

mayank · June 24, 2010, 12:11am

Then can you provide some insight as to from where does cudaProf gets the information to produce the numbers?

Gregory_Diamos · June 24, 2010, 7:59am

I am under the impression that a certain number of hardware performance counters exist and that cudaProf reads these after a kernel has executed. These performance counters are exposed in PTX via the pmevent instruction and some special registers, but only the process of reading the 32-bit and 64-bit cycle counters is currently documented in the PTX spec. There is definitely enough information in the spec to reverse engineer the other counters (assuming that ptxas will actually generate code for them if you use them in PTX), but no has done so yet. There is also the possibility that nvidia may add better documentation for them in PTX if people ask about them, as they have been pretty good at updating the spec when I have asked about things in the past.

mayank · June 24, 2010, 8:46pm

I have managed to predict all counters but instructions in cudaProf by looking at the PTX code. The Instructions counter seems to elude me. Sometimes I overpredict and sometimes I underpredict. Hence, I was looking at a way by which I can predict the Instructions counter specifically in cudaProf. Hope NVIDIA folks document it somewhere.

wumpus · June 27, 2010, 1:26pm

You could use something like decuda to disassemble the actual hardware instructions, and then count them. That might be a better predictor. PTX (as said) is just an intermediate language, you cannot determine any hardware instruction count from that.

mayank · June 28, 2010, 6:32pm

I used decuda on a very simple kernel.

The PTX code is

$LBB1__Z6kernelPi:

		.loc	15	  10	  0

		mov.s32		 %r1, 110;

		ld.param.u64	%rd1, [__cudaparm__Z6kernelPi_d_arr];

		cvt.u32.u16	 %r2, %tid.x;

		mov.u16		 %rh1, %ctaid.x;

		mov.u16		 %rh2, %ntid.x;

		mul.wide.u16	%r3, %rh1, %rh2;

		add.u32		 %r4, %r2, %r3;

		cvt.u64.s32	 %rd2, %r4;

		mul.lo.u64	  %rd3, %rd2, 4;

		add.u64		 %rd4, %rd1, %rd3;

		st.global.s32   [%rd4+0], %r1;

		.loc	15	  19	  0

		exit;

$LDWend__Z6kernelPi:

		} // _Z6kernelPi

and the output of decuda is:

// Disassembling _Z6kernelPi

.entry _Z6kernelPi

{

.lmem 0

.smem 24

.reg 2

.bar 0

mov.b16 $r0.hi, %ntid.y

cvt.rn.u32.u16 $r1, $r0.lo

mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1

shl.u32 $r0, $r0, 0x00000002

mov.b32 $r1, 0x0000006e

add.u32 $r0, s[0x0010], $r0

mov.end.u32 g[$r0], $r1

}

The number of instructions in cudaProf is 8 for this kernel. Decuda shows 7 instructions only. Why is this discrepancy. Moreover, there are no load/store instructions in decuda output

ihaque · June 28, 2010, 7:13pm

// Disassembling _Z6kernelPi

.entry _Z6kernelPi

{

.lmem 0

.smem 24

.reg 2

.bar 0

mov.b16 $r0.hi, %ntid.y

cvt.rn.u32.u16 $r1, $r0.lo

mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1

shl.u32 $r0, $r0, 0x00000002

mov.b32 $r1, 0x0000006e

add.u32 $r0, s[0x0010], $r0

mov.end.u32 g[$r0], $r1

}
The number of instructions in cudaProf is 8 for this kernel. Decuda shows 7 instructions only. Why is this discrepancy. Moreover, there are no load/store instructions in decuda output

The kernel has been compiled such that arguments are passed in shared memory. There’s no explicit parameter load from shared memory because the compiler knows where in shmem that argument lies, and most of the arithmetic instructions let you have one operand in shared memory. It’s used as the first input operand to the mad24 instruction:

mad24.lo.u32.u16.u16.u32 $r0, s[0x000c], $r0.hi, $r1.

The global store at the end is represented as a mov from a register to global memory:

mov.end.u32 g[$r0], $r1

mayank · June 28, 2010, 7:24pm

But even then the number of instructions is not 8 as shown in cudaProf. Is any of the instruction is to be counted as 2 in the output of decuda?

One more question I had was does cudaprof count the load/store instructions also when it reports the number of instructions or it neglects them and they are taken care of in gld/gst counters?

Thanks again.

Sylvain_Collange · June 28, 2010, 8:14pm

I found the number of instructions reported by CudaProf to be slightly different than the numbers I get from simulation. The discrepancies are usually lower than 1%. The other cases are easily explained by the fact that CudaProf gathers statistics on one SM or TPC only, so it is affected by load imbalance.

As for the remaining differences, blame Heisenberg principle. ;)

I suspect this is caused by the profiler itself. Did you try disabling all counters except ‘instructions’ in CudaProf?

tera · June 28, 2010, 11:28pm

Having no idea of the internal implementation, I would expect an implicit exit instruction to hide in there as well…

Topic		Replies	Views
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1590	July 9, 2019
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3945	December 21, 2016
What can be learned from IPC (via nvprof)? CUDA Programming and Performance	9	3184	July 13, 2018
instruction or operation CUDA Programming and Performance	16	3277	March 28, 2019
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3045	November 13, 2017
Ptxas slow CUDA Programming and Performance cuda , kernel	35	2034	May 2, 2024
cuda memory usage in debug(with GDB),debug(without GDB) and release differ, extra 2GB usage in relea CUDA Programming and Performance	11	4208	February 9, 2016
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12170	February 12, 2013
Profiler: Instruction Count Details about the instruction count field in the profiler output CUDA Programming and Performance	8	5179	January 15, 2009
How to accurately time individual memory operations CUDA Programming and Performance	17	6252	September 12, 2016

Correlation between instructions in ptx and cudaProf

Related topics