Implications of WARP Size Future GDUs

In a previous forum entry, someone mentioned that the WARP size, if it were to change, would most likely get larger and not smaller. I can see why it would be better if the WARP size gets smaller.

I’ll put my 2 cents in. If anyone knows the importance of the WARP size, could you please add a few more cents :)

[U]Importance of WARP Size (Smaller or Larger than current size of 32:[U]

  • A smaller WARP size may be advantageous because the number of threads affected by the divergence caused by some flow control instructions is smaller. Finer granularity for programs needing flow control. In this case, it seems that a WARP size of 1 would be ideal.

I just thought of this. I apologize if I’m putting down on paper something that is very obvious to most users. I just started learning about this CUDA stuff.

It seems to me that it would be very advantageous with future GDUs, if there is a runtime function that would allow the programmer to control the WARP size in use with a Kernel. Let me explain:

If I have a Kernel that without many flow control instructions, then it may be advantageous for the WARP size to be 32, or some other relatively large value. On the other hand, if I have a Kernel that needs some flow control instructions, then divergence is an issue that influences the performance. For such a case, a WARP size of 1 may make more sense.

NVIDIA: Since GDUs are programmable beasts, do you see any benefits to having an architecture for future GDUs that allow the programmer to select the WARP size or other such parameters that affect performance, on a per Kernel basis?

The warp size, I think, is dictated by the hardware. Each multiprocessor only has one instruction decoder, so each of the 8 (on current cards) stream processors needs to run the same instruction. That means the minimum warp size possible is 8. The stream processors are also pipelined, so for maximum efficiency you are going to need 2 instructions in flight to keep the pipeline stages busy. The easiest way to do that is to run the same instruction you already decoded, but for another set of threads, which doubles the warp size up to 16. There is also a clock rate difference between the instruction decoder and the stream processors, so perhaps you will need some extra time to decode the next instruction, so doubling the warp size again to 32 seems plausible. (I am fuzzy on the last step. Someone who really knows the Nvidia hardware would have to comment if I got the details right.)

So, I think the size of the warp on current hardware could not be reduced without causing parts of the chip to be underused. This is probably a general result, in fact. The best warp size for any chip will be the smallest one that can keep the chip busy 100% of the time. A bigger one will have no benefits (and possibly underutilize the instruction decoder), and a smaller one will underutilize the stream processors as they wait for new instructions.

For future products, I guess it is a design tradeoff: If you spend transistors on more instruction decoders, then the warp size can be reduced, making divergent kernels faster. But if you spend the transistors on more stream processors per multiprocessor, then you will finish highly parallel jobs faster. Given that Nvidia’s main business is selling cards for graphics, I bet the deciding point on that will be the best balance for 3d rendering. :)


Thanks for the excellent information on the basic architecture of GDUs.

For C870 boards, where their architecture does not necessarily have to play nice with Graphics processing, I suppose the GDU could be designed for just CUDA (it probably already is).

In addition, in the future, if a C870 has DUAL GPUs on board, one possible architecture would be where one of the GDUs on the card is designed for Kernels that have very little need for flow control, while the other GDU on the board is designed to perform better for kernels with lots of flow control. That is, two GDUs for more “types” of kernels.

Here is another idea from a beginner and naive CUDA student. Design future C870 boards with a way to add daughter cards to the board. That is, the user could keep pluggable daughter cards around and plug appropriate GDU modules to the C870 base design to configure the hardware depending on the “types” of kernels needed for an application.

I think you are missing some very important point:

  • GPU development is driven by the needs of 3D games
  • There are a lot of people buying these 3D cards
  • The price of developing a GPU is very high

So you can see that it is a very bad business model to design GPU’s especially for a niche market. And if they would do it, the price would be so high that the biggest benefit of CUDA (lots of processing power for a relatively low price) is gone.

I personally think that the biggest revolution will be that there will be a LOT of CUDA capable cards out there. So software companies that need more processing power can start to use it now, since it will not be a small group of people that can use it, but a big group.

One other benefit of a larger warp size is that it will likely come with the ability to have a larger blocksize. Which means that there can be more threads that communicate through shared memory.

If you need a massively parallel processor with a ‘warp size’ of one, GPU programming might not be the best for your application in the first place. You could look at things like the Niagara chips from Sun, which are more aimed at doing a lot of slightly different jobs at ones (like web serving).

For a lot of things the large warp size is an advantage, not a disadvantage. It means the GPU can execute instructions really fast without spending too much time in instruction decoding.