I’m interested in knowing more details about the “instructions” field in the CUDA profiler output.
The CUDA profiler manual states:
This options records the instruction count for a given kernel.
I’ve observed that the value of instructions as recorded by the profiler changes (although by very small amounts) across different invocations of the very same kernel, without recompilation. For example, running the same m-m multiply kernel twice gives the following profiler output:
Instance 1: occupancy=[ 0.667 ] branch=[ 200704 ] instructions=[ 4487178 ]
Instance 2: occupancy=[ 0.667 ] branch=[ 200704 ] instructions=[ 4487394 ]
As you can see, the number of branches have remained unchanged across the two invocations, but the instruction count has changed. Obviously, the number of instructions generated by the compiler remains unchanged, and I guess so does the scheduling of thread-blocks.
Given this, why does the instruction count recorded by the profiler change - which in turn leads to - what exactly is the instruction count recorded by the profiler and how does it relate to the number of instructions generated by the compiler per thread (which can be statically determined by examining cubin output, for example).
I have also observed that GPU occupancy seems to influence the instruction count. I ran an experiment in which I kept the kernel code unchanged between 2 invocations, but just changed the shared memory requirements, such that 3 blocks could run concurrently in the first instance and only 1 block could run in the second instance. In this case, the instruction count changed significantly between the two invocations.
Instance 1: occupancy=[ 1.000 ] branch=[ 131072 ] instructions=[ 331257 ] // max concurrency = 3 blocks
Instance 2: occupancy=[ 0.333 ] branch=[ 131072 ] instructions=[ 341866 ] // max concurrency = 1 block
You can see here that the instruction count has increased significantly, when the occupancy has reduced. So how does the occupancy relate to the instruction count?
Can someone who has insight into the working of the profiler please answer the above questions? Please excuse me for the lengthy post - I thought more details would help answer the question better.
I believe it is in the releasenotes of the profiler, but the profiler profiles only 1 multiprocessor. So that is where differences might be coming from.
I’m not sure I understand completely. I have read the release-notes, and understand that the profiler profiles only 1 MP. But how does that explain differences in instruction counts for consecutive executions of the same kernel and also the difference in behavior with change in GPU occupancy. Could you please elaborate?
hmm, i will try to check that tomorrow, should be easy to do. That would also certainly be something to keep in the back of your mind (I have kernels where ‘random’ blocks need to return immediately
Yes - what you say does make sense. But in the examples that I’ve cited above, the kernel that I was running was a simple matrix-matrix multiplication of 1024 X 1024 square matrices with threads blocks of size 16 X 16. That gives us a total of 4096 thread-blocks with 256 thread-blocks per MP. So, the extra block scenario does not seem to arise here - unless the runtime scheduler schedules the blocks differently, of course.
And also, since this is m-m multiply, the body of the kernel has no conditionals or early-exit sequences.
Although it does not mention it explicitly, the CUDA profiler document seems to imply that the profiler data corresponds to all the blocks that run on the MP and not just a single block.
Here is what the document says - I’d like to know how you’d interpret it.
Are you accesing global memory from your kernel? If so, this may explain why instruction count varies – latency may be slightly different between runs or something like this. Lowering occupancy is likely to make this difference bigger because some of MPs will be underutilized.
Just a theory :)