Profiler: Instruction count Details about the instruction count field in the profiler output

Hi all,

I’m interested in knowing more details about the “instructions” field in the CUDA profiler output.

The CUDA profiler manual states:
This options records the instruction count for a given kernel.

I’ve observed that the value of instructions as recorded by the profiler changes (although by very small amounts) across different invocations of the very same kernel, without recompilation. For example, running the same m-m multiply kernel twice gives the following profiler output:
Instance 1: occupancy=[ 0.667 ] branch=[ 200704 ] instructions=[ 4487178 ]
Instance 2: occupancy=[ 0.667 ] branch=[ 200704 ] instructions=[ 4487394 ]

As you can see, the number of branches have remained unchanged across the two invocations, but the instruction count has changed. Obviously, the number of instructions generated by the compiler remains unchanged, and I guess so does the scheduling of thread-blocks.

Given this, why does the instruction count recorded by the profiler change - which in turn leads to - what exactly is the instruction count recorded by the profiler and how does it relate to the number of instructions generated by the compiler per thread (which can be statically determined by examining cubin output, for example).

I have also observed that GPU occupancy seems to influence the instruction count. I ran an experiment in which I kept the kernel code unchanged between 2 invocations, but just changed the shared memory requirements, such that 3 blocks could run concurrently in the first instance and only 1 block could run in the second instance. In this case, the instruction count changed significantly between the two invocations.
Instance 1: occupancy=[ 1.000 ] branch=[ 131072 ] instructions=[ 331257 ] // max concurrency = 3 blocks
Instance 2: occupancy=[ 0.333 ] branch=[ 131072 ] instructions=[ 341866 ] // max concurrency = 1 block

You can see here that the instruction count has increased significantly, when the occupancy has reduced. So how does the occupancy relate to the instruction count?

Can someone who has insight into the working of the profiler please answer the above questions? Please excuse me for the lengthy post - I thought more details would help answer the question better.

Thanks for your time,

Bumping up, hoping someone will reply :)