AFAIK, unpublished (google?)
AFAIK, unpublished (google?)
I’m pretty sure the size is not adjustable.
A point you may have missed is that this is a form of latency.
The GPU is intended to be a latency-hiding machine. The GPU hides latency by having lots of available work to switch to.
So any kernel may initially be instruction-bound until the machine begins to get enough instructions in the hopper that it can begin to hide work - using the standard latency hiding mechanism.
If you have a very short kernel, or a kernel that launches only a small number of threads, or otherwise doesn’t expose much parallelism, then the machine may have difficulty hiding latency, and the result may be a form of efficiency loss, (e.g. instruction bound).
There may be some cases where you have written some code that the profiler suggests is “instruction bound”, but the correct response is not to attack the instruction-bound-edness, but to expose more parallelism.
It’s impossible to say since you haven’t shown any code. Even if you had, I’m not saying I will do the analysis for you. The simple fact that you have more than 512 SASS instructions in your kernel does not necessarily mean that things will improve when you drop below 512.
Let’s take a very simple example. Suppose I have launched threadblocks of 32 threads. That means, with respect to that threadblock, each instruction in the instruction stream will be executed by a single warp. Now suppose I launch the same code with threadblocks of 256 threads. That means that each instruction in the instruction stream is effectively re-used 8 times as it is processed by each of the 8 warps in the threadblock. This would result in an 8x reduction in instruction fetch pressure.
This doesn’t necessarily tell the whole story, as it does not account for multiple threadblocks being resident on an SM, for example. But this is the sort of code characteristic that I am referring to. For such a code, I would not look to squeeze my kernel code down to under 512 instructions. I would look for ways to expose more parallelism so that I can launch as many threads as the SM will hold.