"Instruction Fetch" in Nsight Performance Analysis

I use Nsight to analyze the performance of my kernel.
“Instruction Fetch” is the major reason and takes 29% in Issue Stall Reasons.
I tried to find what factors result in Instruction Fetch. However, it is difficult to find it even in CUDA
It is from branch reason like “if…else…”? I am not sure.

Could you give me some suggestion?

Thank you very much!

It’s a form of latency.

branching could cause it

also, if you have loops that are long enough not to fit in the instruction cache, then that could result in instruction fetch issues.

try slide 62 here:


Thank you very much, txbob.

You are right. I have 6 loops for each thread.
To my understanding according to what you said, I need to reduce the loops?

Note that txbob specifically referred to loops whose loop body does not fit into the instruction cache. The instruction cache on GPUs is quite small, I think 512 instructions is a typical size.

GPUs do not have branch prediction and fetch instructions in simple increasing address order. Any branch (loop or not) that causes instruction fetch to be restarted at a non-sequential location can cause a stall. A much lengthier stall will occur if the branch target location is not in the instruction cache.

In case loops are the major source of branches in your code, can you get rid of some of them, maybe by increased parallelization? It is not really possible to give specific advice without knowing more about the actual structure of your code.

Thanks a lot, njuffa.

512 instructions is a fixed size? I mean that if I disassemble my codes to check the number of the instructions. regardless of the number of loops, if the number of instructions is less than 512, is it ok?

I don’t happen to know what the size of the instruction cache (I$) is. AFAIK it’s not published, although microbenchmarking may be able to discover it, and there may be material that you can google for:


that would suggest the size for a particular GPU architecture.

However, the I$ is a SM-wide resource. This means that all warps in a given SM which may need instructions will attempt to hit in the I$ (this is really a function of the instruction dispatch unit in the SM, not the warp itself).

Having said that, it is still probably a reasonable starting point to ask whether there are more (SASS) instructions in the kernel than can fit in the I$. That may be the first question. However, in the event of concurrent kernels (and possibly other scenarios) the I$ may be shared by more than one instruction stream.

Thanks, txbob!
Where to cache instructions, L1?
Is it separate from the data?
If the size is adjustable, it will be better.

I have disassembled my kernel. The instruction is more than 512.
I think that I need to reduce the size and try it first.

AFAIK, unpublished (google?)

AFAIK, unpublished (google?)

I’m pretty sure the size is not adjustable.

A point you may have missed is that this is a form of latency.

The GPU is intended to be a latency-hiding machine. The GPU hides latency by having lots of available work to switch to.

So any kernel may initially be instruction-bound until the machine begins to get enough instructions in the hopper that it can begin to hide work - using the standard latency hiding mechanism.

If you have a very short kernel, or a kernel that launches only a small number of threads, or otherwise doesn’t expose much parallelism, then the machine may have difficulty hiding latency, and the result may be a form of efficiency loss, (e.g. instruction bound).

There may be some cases where you have written some code that the profiler suggests is “instruction bound”, but the correct response is not to attack the instruction-bound-edness, but to expose more parallelism.

It’s impossible to say since you haven’t shown any code. Even if you had, I’m not saying I will do the analysis for you. The simple fact that you have more than 512 SASS instructions in your kernel does not necessarily mean that things will improve when you drop below 512.

Let’s take a very simple example. Suppose I have launched threadblocks of 32 threads. That means, with respect to that threadblock, each instruction in the instruction stream will be executed by a single warp. Now suppose I launch the same code with threadblocks of 256 threads. That means that each instruction in the instruction stream is effectively re-used 8 times as it is processed by each of the 8 warps in the threadblock. This would result in an 8x reduction in instruction fetch pressure.

This doesn’t necessarily tell the whole story, as it does not account for multiple threadblocks being resident on an SM, for example. But this is the sort of code characteristic that I am referring to. For such a code, I would not look to squeeze my kernel code down to under 512 instructions. I would look for ways to expose more parallelism so that I can launch as many threads as the SM will hold.

Thank you very much, txbob.

Do you have any comment about writing performance of Surface?