GPU architecture and CUDA kernel execution


I am afraid I don’t quite understand the basic operation of a GPU when executing CUDA kernels as described in the Programming guide. Chapter 3 nicely describes GPU implementation but in my opinion there are some contradictory states in it.
It says, for example, that:

  • a block of threads is executed on a single multiprocessor (MP)
  • a MP consists of 8 scalar streaming processors (32)
  • each thread is executed by one SP
  • the threads are executed in groups of 32, called warps
  • a MP can execute up to 8 blocks concurrently

The guide says:
Every instruction issue time, the SIMT unit selects a warp that is ready to execute
and issues the next instruction to the active threads of the warp. A warp executes
one common instruction at a time, so full efficiency is realized when all 32 threads
of a warp agree on their execution path.

From what I understand, only 8 threads of the same block can actually run concurrently in the right sense of the word, and even that if they don’t diverge (because a block is not divided among MPs). So the upper sentence makes no sense. If my GPU consists of 12 MPs, then 12*8=96 threads actually run concurrently, but in case of total divergence, the number of concurrent threads becomes 12.
The meaning of warps is completely vague to me, because a warp is not the unit of concurrent execution but used only for scheduling.

I hope someone can enlighten me about the true state of affairs. I would recommend some sort of pseudo code of the thread/warp/block execution control program to be added to the guide, because it is the only true CUDA literature available (besides reference and tutorials, which are not concerned with the discussed information).

Thank you in advance.

The scalar streaming processors have a pipeline* of length 4. So each is loaded with four threads at a time. Those four threads are in different, overlapping states of execution at any time. So it depends on your definition of concurrent execution: on your GPU there would physically be 384 threads being worked on at any time though only 96 will be spat out per a clock cycle.

If there’s total divergence (each and every thread follows a different execution path) than you’re right, there will be only 12 threads active AFAIK. Not only will you loose warping but probably pipelining as well.
But hey, this is supposed to be SIMD :)

As for “a MP can execute up to 8 blocks concurrently”, that’s not entirely true. The MP can be scheduled with 8 blocks. It can only run a single warp concurrently but it can pick a warp to execute from any of these 8 blocks. That’s a solution for load balancing by the way.

    • Are you familiar with pipeline execution?

I am familiar with pipelining (a short course during the study). However, I suppose that pipelining of multiple threads is very different from what I know about pipelining of traditional (single thread) execution in CPUs, where the next instruction is being fetched while the current is being executed and the result of previous one being stored. My dilemma here is what you mean by overlapping states of execution. Would it mean that the four pipelined threads are executing different, but consecutive instructions in code, or the same instruction with different environment (i.e. register states). I believe you meant the latter case, because specs clearly state that divergent threads are executed in sequence. I found no mention of pipelining, but perhaps I should look for other reference about scalar processor architecture.

To sum up - if I have a total coherence of threads, there are 384 executed concurrently, but the delay between the first one finishing and the last one finishing is 4 clock cycles. Right?

The same speed is achieved if I have 4 packs of 96 coherent threads (within the same warps of course)?

OK, total divergence was a hypothetical case, of course. But a marginal case like that explains a lot about the architecture, don’t you agree? A few examples would perhaps even say more than a 10 page chapter 3 in a guide.

Now, at least the last part is clear. :) It’s amazing that a GPU scheduler can use such elaborate load balancing techniques without incurring to much overhead.

Thank you anyway, my understanding is at least somewhat better now.

All threads in a warp execute the same instruction (on different data/registers). However, executing a single instruction requires a few steps like fetching data from a register, doing the real calculation, storing the result in a register etc. These steps are pipelined and overlap. For example the first 8 threads from a warp may be, at some point, storing their results while the next 8 (further back in the pipeline) may be at the earlier stage of doing an A/L operation. I may be simplifying things but I hope you get the idea.

If by “finishing” you mean reaching the same state of execution of an instruction than yes, I think that’s the case.

Or even 12 packs of 32 threads.

In theory yes, in reality it’s nearly the same because the MP must do some more condition checking (obviously no divergence is better than optimized divergence). But a rule of optimizing branching is to have consecutive warps of threads follow coherent paths. The following code should suffer little overhead due to branching:

if (threadIdx.x<32)




A warp contains 32 threads of consecutive ids, so threads 0-31 go to one warp, 32-63 go to another etc. Warp level branching is nearly as fast as no branching at all but misaligned or finer-grained branching hits your SIMD performance.

IIRC it is noted somewhere in the programming guide, that the instruction decoder only runs at one fourth of the speed of the streaming processors. Therefore the way I understand it, is that the same instruction is executed 4 times by each SP, however for 4 different threads, i, i+8, i+16, i+24. That’s not pipelining in the traditional way, as that would require an own instruction for each step of the pipeline.

The difficulty with CPU pipelining is that the instructions in a single thread tend to be dependent on the results from previous instructions, which leads to pipeline hazards. There are all sorts of tricks to fix this, including out-of-order execution, hyperthreading, etc. All of these things take up area on the chip.

Since CUDA bundles threads into groups of 32 (the warp) which must execute the same instruction at once, the hardware pipeline can exploit that. Every thread in the warp is guaranteed to be independent, so there are no hazards by construction. (In fact, there can be read-after-write hazards, and the programming guide mentions that > 192 threads are needed to hide these.) This is a different approach to pipelining similar in spirit to hyperthreading, though the restricted execution model of CUDA means that warp-pipelining is much more efficient. As a nice bonus, without all of the complexity of out-of-order execution and one instruction decoder per thread, more die area can be devoted to FPUs rather than control logic.

Do you know the exact pipeline steps of SP? I mean just say IF, EX, MEM…

A year later and a few insights smarter, I’ll tell you I now believe it’s not a CPU-like pipeline after all. I’m not sure how it works.

Mmm… Ok. what do you say about next?

SP has pipeline length of 4. Presumably, it consists of 4 stages: Fetch, Decode, Execute, Write-back. If I get it right every SP on its pipeline (1[ ] , 2[ ], 3[ ], 4[ ]) executes four flows of instructions (4 threads), for example: th1(1.1, 1.2, 1.3); th2(2.1, 2.2, 2.3); th3(3.1, 3.2, 3.3); th4(4.1, 4.2, 4.3) by this way:

clock01: 1[1.1] , 2[0.0], 3[0.0], 4[0.0] (025%)

clock02: 1[2.1] , 2[1.1], 3[0.0], 4[0.0] (050%)

clock03: 1[3.1] , 2[2.1], 3[1.1], 4[0.0] (075%)

clock04: 1[4.1] , 2[3.1], 3[2.1], 4[1.1] (100%) - 1.1 done

clock05: 1[1.2] , 2[4.1], 3[3.1], 4[2.1] (100%) - 2.1 done

clock06: 1[2.2] , 2[1.2], 3[4.1], 4[3.1] (100%) - 3.1 done

clock07: 1[3.2] , 2[2.2], 3[1.2], 4[4.1] (100%) - 4.1 done

clock08: 1[4.2] , 2[3.2], 3[2.2], 4[1.2] (100%) - 1.2 done

clock09: 1[1.3] , 2[4.2], 3[3.2], 4[2.2] (100%) - 2.2 done

clock10: 1[2.3] , 2[1.3], 3[4.2], 4[3.2] (100%) - 3.2 done

clock11: 1[3.3] , 2[2.3], 3[1.3], 4[4.2] (100%) - 4.2 done

clock12: 1[4.3] , 2[3.3], 3[2.3], 4[1.3] (100%) - 1.3 done

clock13: 1[0.0] , 2[4.3], 3[3.3], 4[2.3] (075%) - 2.3 done

clock14: 1[0.0] , 2[0.0], 3[4.3], 4[3.3] (050%) - 3.3 done

clock15: 1[0.0] , 2[0.0], 3[0.0], 4[4.3] (025%) - 4.3 done

So if we have less threads on pipeline then 4, we have the loss of perfomance. For 3 threads we will have perfomance 75%, for 2 - 50 %, for 1 - 25%. This is the answer why the length of warp is 32 (to provide 4 theads per SP, to load SP on 100%).

Also, if upper scheme is correct it shows that the performance of SP pipeline depends on length of threads.

What do you think?

SPs don’t fetch and decode instructions, they just execute them on their own vector lanes (just like parts of SSE units…)

The execution order you describe is correct, but pipelines are much deeper than 4 stages. Streaming Multiprocessors have around 30 pipeline stages (at “shader”/“fast” clock, which is twice the “slow” clock).
This is why it is recommended that you run at least 6 warps (or 192 threads) per SM to hide the read-after-write latencies, as Seibert tells you (Section 5.2 of the CUDA Programming Guide).

Approximate timings are:

  • Fetch and decode: 4 slow clock cycles
  • Operand gather: 1 to 3 slow clock cycles
  • Execute: 10 to 20 fast clock cycles
  • Write-back: 1 slow clock cycle

The actual figures are not available, but this should be in the ballpark.

Fetch, decode, gather and write-back act on whole warps, so they have a throughput of 1 warp/slow clock.
Execute acts on group of 8 threads or quarter-warps (there are only 8 SP/SM), so their throughput is 1 warp/4 fast clocks or 1 warp/2 slow clocks.

The Fetch/decode/… stages have a higher throughput in order to be able to feed both the MAD and the SFU/MUL units alternatively. Hence the peak rate of 8 MAD + 8 MUL per (fast) clock cycle.

This is what confuses me. For example, a GTX 295’s shader clock is 1242 MHz, half that is 621. Yet “core clock” is 576 MHz which is not exactly half the shader clock. What am I missing? Is “core clock” something else than the clock driving MPs?

Yes. This puzzled me for a while, too. But the core clock is different than what I call the slow shader clock.

So there are (at least) 4 separate clock domains. In the GTX 295:

    Memory clock at 1000 MHz: DRAMs and (part of) memory controllers.

    Core clock at 576 MHz: SM controllers, texture access units and caches, ROPs, command processor, graphics-related stuff and misc. logic.

    Fast shader clock at 1242 MHz: arithmetic units.

    Slow shader clock at 621 MHz: warp scheduler, registers, pipeline control… inside an SM.

(Instead of an actual fast clock generated by some PLL, the SPs might use both the raising and falling edges of the slow clock, although it makes no difference from a functional point of view.)

I found no official confirmation of this, but this is the most reasonable explanation I can think of.

2 Sylvain Collange

I think you mean something like this: SIMT == SIMD?

OK, If 8 SPs is simply an abstraction for the SIMD architecture, then what we have in reality? I mean how many real cores / ALUs contains SM? What is the size of the SIMD-register? What is the length of the pipeline of SM, from what stages it is composed, at least approximately?

I do not need the deep details, I want to understand why the SM performs the flow of instructions in groups of 32? Why it is required 4 clocks to perform 32 threads? And why branching is not possible between groups of 8 threads?

I am sure David Kanter is better at explaining these things than I am: ;)…8195242&p=6