Why does a warp consist of 32 threads? Why is a thread not say 16 or 64 threads? Whats the hardware

Hi!

I’m doing a project where I have to studying quite a bit of CUDA and my supervisor asked me why a warp was 32 threads, and not say 16 or 64. So I have to find out why. I’v been trying to find the reason why it is 32 on and off for a few days. So far, I’v come to the conclusion that:

As the clock frequency for the 32-bit FPUs is twice that of the instruction unit, the FPUs can perform two identical operations in series before the instruction unit has a new, different operation. 8 FPUs * 2 = 16, so warp should be 16 threads… This is of course not the case.

Can anyone help me to understand why the warp consists of 32 threads? Where is my missing factor of 2? Or am I completely off target?

Thank you,

Ian

You’re completely right about the first part of the puzzle, but there’s another piece.

The MP actually consists of 8 SPs and 2 transcendental units. The SPs compute basic arithmetic - addition, multiplication, binary ops. The transcendental units handle complex operations, like reciprocal, reciprocal square root, logs, exponentiation, and trig functions. This means that the instruction pipeline is dual issue. First, the SPs get an instruction, then the transcendental units get an instruction.

As a result, it’s 4 clock cycles for each instruction cycle. First the instruction for the SPs is issued (2 cycles - remember that the instruction clock is only half that of the ALU clock), then the instruction for the transcendental units is issued (2 more cycles).

That gives 8 SPs* 4 cycles = 32 threads in a warp.

Note that the transcendental units don’t have threads of their own - they just service special instructions for the entire MP.

Hello and thank you for your quick answer!

Ok, that seems to make a lot of sense (even if I have to read up a bit instruction pipelines a bit. Wikipedia, here we go!).

Your answer, however, gives rise to new questions! If I’v understood pipelining correctly, it is usually used in say a Core 2 Duo to feed both cores with instructions so that they both can work simultaneously (if there are no conflicts or dependencies). Does this mean that the 2 SPUs can work in parallel (don’t we all love that word) with the 8 FPUs, in effect having a throughput of 8 + 2 = 10 operations per cc?

By the way, you wouldn’t have a source or know where I can find one?

I greatly appreciated your answer! Thank you very much!

Ian

On a lighter note,
This is just an usual ploy of supervisors to show off that they can ask questions that are difficult to answer… lol…

btw,
There are 16 smem banks each capable of delivering data per 2 clock cycles… So 16*2 = 32.

If your supervisor keeps on pressing…, you could say “WARP size cannot be an imaginary number. It has to be a +ve integral number. NVIDIA chose 32 as per their convenience”… :)

Ok, thanks. And the supervisor doesn’t know much about CUDA, so I think he’s asking mostly due to lack of insight in CUDA, not for showing off. But thanks for the tip anyway :) .

I like your theory for explaining where the second 2 comes from; didn’t think of that earlier. Thanks for your input!

Right… I believe the hardware hides some other latency during the time smem tries to fetch the data… – jst a guess

Ahh got it… Your supvisor frequents this forum… Isnt it?? :)

GOod luck,

Bye bye

Lol, no, I don’t think he’s here.

Anyway, Nvidia aren’t that clear on why things work the way they do, if they even explain how they work. At least not when you start digging.

Ian

FWIW, I think that at NVISION 08, it was let slip that the real hardware scheduling unit is 16 threads (i.e. a half-warp). This is why all the memory coalescing rules are based on half-warps. However, since it’s entirely possible that the hardware scheduler will deal with bigger chunks in future, CUDA is made to deal with warps of 32 threads. It’s a simple means of future-proofing.

I guess 32-bit device is likely to have 32 threads in a warp, because you can put various thread bit masks into a single internal register.
In the video for a future CUDA debugger (NVIDIA Nsight Visual Studio Edition | NVIDIA Developer) you can see 32-bit ActiveMask, which seems to have a bit for every running thread in a warp.

The actual size of a warp is a tradeoff between two competing factors:

  • The bigger a warp is, the easier it is to manage and schedule a large number of threads. A big warp also makes it easier to keep the deep pipelines full on the FPUs. This is one reason that NVIDIA may expand the size of a warp in the future. If multiprocessors go from 8 to 16 FPUs, then you will need to double the warp size to keep the pipelines full with the same low probability of pipeline hazards as you have now.

  • The smaller a warp is, easier it is to efficiently implement some algorithms. Your block size needs to be a multiple of the warp size, or you will have bubbles in the pipeline. Additionally, a large warp increases the penalty for branching in some cases. Branches in which an entire warp take the same branch have no scheduling penalty, whereas branches that diverge across the warp require splitting it (or masking it and running it twice, or however they handle it in hardware). Algorithms in which the amount of shared memory required scales with the size of the block might also prefer a smaller warp, since that would make smaller blocks possible. (Remember, a block smaller than a warp means you have empty pipeline slots.)

So efficiency favors large warps, and flexibility favors small warps. Someone at NVIDIA must have analyzed these two factors and decided that 32 (or perhaps 16 in current hardware, but 32 in the future) was an acceptable tradeoff for the kind of workloads they imagined.

2 Likes

Hi all!

Thanks for your replies! It does seem like no one is 100% certain on why a warp is 32, or if you are, then there are some condradictions between you :).

I’ll keep on trying to get an answer and if I do, I’ll post it here. I might have some of your theories in the report and point out that CUDA is a bit of a black box at times, which of course has the benefit of enforcing a high level of abstraction.

Cheers!

Ian

As already has been said: it simply is a tradeoff between different factors. They could have made it 16 or 64. In fact, for some GPUs from AMD/ATI the equivalent of what is called ‘warp size’ at NVidia is 64.

As far as I have heard:

current hardware could have a warp size of 16, but for forward-compatibility they chose 32, so they can still double the amount of ALU’s per SM without needing people rewriting/recompiling their software.

So from the postings above, I can conclude that the warp size must be +ve integer and a power of 2.