How does reducing unrolling or branching code actually reduce instruction fetch?

Hey experts!

I was just wondering how exactly reducing manually unrolling code or reducing branch divergence actually reduces instruction fetch? Inside an if statement does it need to fetch all of the instructions nested inside? And therefore, more instructions results in higher instruction fetch? How does the GPU choose when to fetch instructions?


At least in the past (don’t know about the Pascal architecture), GPUs had no branch prediction whatsoever, which simplifies the hardware design. The GPU fetches instructions in simple order of increasing addresses. Any deviation from this via control transfers instructions therefore introduces a pipeline bubble.

Under normal circumstances these pipeline bubbles are well covered by the GPU’s zero-overhead context switching, but the effect can become noticeable (to the tune of 2-3% typically) when the control transfer also results in an instruction cache miss, e.g. a loop-closing branch for a loop body that doesn’t fit into the ICache.

If we compare “branchy” code with equivalent “non-branchy” code, we will typically find that the dynamic instruction count for the “non-branchy” code is smaller, since the latter saves the control transfer instructions as well as pushes and pops from the convergence point stack (e.g. SSY instructions visible in SASS). A smaller dynamic instruction count means fewer instruction fetches.

Loop unrolling is one technique that reduces branches (and potentially other instructions, such as comparisons that control the branch) in the dynamic instruction stream. It doesn’t matter whether this happens due to the compiler unrolling the loop during optimization or the programmer doing it. Given the current state of the CUDA compiler, manual unrolling should rarely be necessary.

What is the context in which this question arose? Instruction fetch is not typically something one has to worry about when optimizing code for GPUs.

Oh I have just been reading about GPU Architecture & Scheduling and wanted a more in depth understanding.

And sorry would you be able to clarify a few things:

  1. What you mean by “control transfer instructions”?
  2. What do you mean by “Dynamic instruction count”?
  3. How big is the ICache and where is it located? I’m assuming the instructions are fetched from global memory?

Thanks :)

Consider consulting a book about computer architecture (e.g. Patterson/Hennessy) for general architecture concepts.

  1. examples for control-transfer instructions: unconditional branch, conditional branch, call, return

  2. static instruction count refers to the count of instructions present in the executable, dynamic instruction count refers to the number of instructions encountered at executable run time, i.e. that are {fetched, issued, retired}, where #fetched >= #issued >= #retired

  3. the GPU has instruction cache in each cluster of execution units, known as SM. The size is not documented and may differ by architecture, but could be reverse engineered by clever experiments. Generally you can assume around 2KB to 4KB. Executable code is stored in global memory, correct.

njuffa’s covered your questions really well. There’s a very dated but extremely well done paper (with code) that goes into the details of backengineering CUDA hardware low level architecture configuration. The paper’s results are four generations old, but the basic principles of its analysis techniques are still valid today. It found the Tesla generation GPU (from 2008!) had a 4kb, 3 way associative instruction cache with a 128 byte line size.

Njuffa thanks! That makes much more sense - will check out that book :)
SPWorley thanks, that’s a really helpful paper! Good find!

In this pdf it states that you can reduce instruction fetch by reducing unrolling however I thought you said unrolling reduces instruction fetch since it reduces the number of control transfer instructions. How is this so?

I am not sure where you see this, please be more specific. I only see this:

Reduce branching to avoid I$ misses
Reduce unrolling and function inlining

Loop unrolling and function inlining reduce the number of control transfer instructions, thus dynamic instruction count, and instruction fetches. They can fairly easily increase the number of instruction cache misses due to the small size of the instruction cache and the large size of instructions (8 bytes each, plus overhead from op-steering words in recent GPU architectures).

Instruction cache miss is of course a stall reason triggered by instruction fetch, that is why it is listed on that slide.

Ohhh I think I get it - there’s basically a trade off for Loop unrolling. Loop unrolling is effective by potentially reducing the dynamic instruction count (from the control-transfer instructions) and therefore remaining in the I$, however if you unroll with a huge body, the dynamic instruction count can potentially increase so that its bigger than the I$ and therefore result in higher instruction fetch due to increased I$ misses (since they aren’t in the cache)?

Thanks :)

Correct, there are basically trade-offs in loop unrolling, so excessive unrolling is not helpful. The CUDA compiler uses heuristics to determine how much a particular loop should be unrolled, typically it uses a factor of four for partially unrolled loops. As with any heuristic, the compiler may occasionally make a bad choice.

Instructions are fetched through the memory hierarchy. If they are not in the instruction cache, they must be pulled from global memory, which is very expensive on GPUs. To give a rough idea, a pipeline bubble introduced by a deviation from sequential flow due to CTIs (control transfer instructions), such as a loop-closing backward branch could be on the order of 20 cycles (provided instructions continue to be fetched from the instruction cache), while a pipeline bubble introduced by an instruction cache miss could be on the order to 600 cycles.

There are actually at least two unrolling optimizations stages in the CUDA compiler: one in the NVVM component (derived from LLVM), the other inside PTXAS (which, despite what the name might suggest, is not a simple assembler, but an optimizing compiler). In the past I observed that sometimes they were at conflict with one another, leading to suboptimal results. But more recent observations of generated code in CUDA 7.5 lead me to believe that this is no longer the case.

hmmm interesting!

The paper that SPWorley posted indicated L1, L2 & L3 instruction caching on their gpus, therefore is it safe to assume that when there is an I$ miss it then checks each cache layer before resorting to global memory?

Also, do you know if they perform read-ahead so when they fetch an instruction from global memory they then load in the next x number of instructions?

It is news to me that GPUs employ a multi-level cache hierarchy as far as instruction caching is concerned. Should that in fact exist (only NVIDIA knows for sure), yes, obviously each level would be consulted in turn before resorting to fetching from global memory. I have my doubts that there really are three levels of instruction caching, it seems needlessly complex for a throughput architecture with high latency tolerance. I wonder whether the authors of that paper might have misinterpreted the data from their microbenchmarks.

NVIDIA keeps details of the GPU architecture a closely guarded secret, although some information may be disclosed via patent applications, while other can be reverse engineered through clever experiments. I have not seen anything published on the details of instruction fetch. A traditional experiment to determine the depth of a prefetch queue is to use self-modifying code that modifies the instruction stream N bytes ahead of the modifying store instruction. Not so easy to set up with a GPU, I haven’t tried it. However, instruction prefetch is a very old technique employed even by unsophisticated CPUs, so it stands to reason that the GPU does use it.

Hmm yeah maybe.

So just to clarify how that experiment works: If the prefetch queue depth was 4 instructions, I would execute the 1st instruction then straight after modify either the 2nd, 3rd or 4th instruction. And for discussion sake lets say I modified the 4th instruction, when I go to execute it, if it executes the old instruction then that instruction was within the queue depth. However if I modified the 5th instruction instead and the queue depth was 4, then it should execute the new instruction?

Exactly. Since the prefetch queue doesn’t have a coherency mechanism, you would check (via a modified instruction) from what forward address on a recently modified instruction can be successfully picked up.

E.g. use a store instruction that writes out a new instruction (e.g. “ADD reg, 1”), followed by a sequence of instructions that do nothing, e.g. “ADD reg, 0”. Initialize ‘reg’ to 0 before this entire code sequence, and at the end check whether ‘reg’ is 0 or 1.

In modern operating environments the first obstacle to performing such an experiment is usually that code space is not writable. Obviously there are ways around this, since this is what many computer viruses do.

That’s such a cool experiment. Is it possible to write to the address space on the GPU using inline PTX or would you need to write it in SASS?

Thanks for the help :)

it’s probably impossible due to memory protection. actually, GPU runs sort of its own OS, and you cannot get superuser pivileges there without modifying the GPU drivers

Ah I see. Thanks anyway :)