I am trying to find out how does cudaProf populate the instructions field. I know it shows the no. of instructions executed per warp per SM.
I believe that it takes the instructions from the PTX code. But if I count the number of instructions in the PTX code and then look up in cudaProf, they do not match.
Sometimes the instructions are more in PTX and sometimes less. Does anyone know if there is any correlation between the no. of instructions in PTX and that shown in cudaProf? Or, from where does cudaProf count the instructions from?
There should only be a loose, imprecise relationship between ptx instructions and events reported by cudaProf.
This is because PTX is an intermediate virtual instruction set that is compiled down into the GPU’s native instruction set before being executed. This compilation process will change the organization, number, and type of instructions that are actually executed on the GPU and reported by cudaProf.
I am under the impression that a certain number of hardware performance counters exist and that cudaProf reads these after a kernel has executed. These performance counters are exposed in PTX via the pmevent instruction and some special registers, but only the process of reading the 32-bit and 64-bit cycle counters is currently documented in the PTX spec. There is definitely enough information in the spec to reverse engineer the other counters (assuming that ptxas will actually generate code for them if you use them in PTX), but no has done so yet. There is also the possibility that nvidia may add better documentation for them in PTX if people ask about them, as they have been pretty good at updating the spec when I have asked about things in the past.
I have managed to predict all counters but instructions in cudaProf by looking at the PTX code. The Instructions counter seems to elude me. Sometimes I overpredict and sometimes I underpredict. Hence, I was looking at a way by which I can predict the Instructions counter specifically in cudaProf. Hope NVIDIA folks document it somewhere.
You could use something like decuda to disassemble the actual hardware instructions, and then count them. That might be a better predictor. PTX (as said) is just an intermediate language, you cannot determine any hardware instruction count from that.
The number of instructions in cudaProf is 8 for this kernel. Decuda shows 7 instructions only. Why is this discrepancy. Moreover, there are no load/store instructions in decuda output
The kernel has been compiled such that arguments are passed in shared memory. There’s no explicit parameter load from shared memory because the compiler knows where in shmem that argument lies, and most of the arithmetic instructions let you have one operand in shared memory. It’s used as the first input operand to the mad24 instruction:
But even then the number of instructions is not 8 as shown in cudaProf. Is any of the instruction is to be counted as 2 in the output of decuda?
One more question I had was does cudaprof count the load/store instructions also when it reports the number of instructions or it neglects them and they are taken care of in gld/gst counters?
I found the number of instructions reported by CudaProf to be slightly different than the numbers I get from simulation. The discrepancies are usually lower than 1%. The other cases are easily explained by the fact that CudaProf gathers statistics on one SM or TPC only, so it is affected by load imbalance.
As for the remaining differences, blame Heisenberg principle. ;)
I suspect this is caused by the profiler itself. Did you try disabling all counters except ‘instructions’ in CudaProf?