Maximum Number of Warps and Warp Size per SM

CUDANT · November 15, 2022, 3:18pm

Hi, I’d like to make sure of my understanding about the max number of warps and warp size per SM. Sorry for this naive question. For Pascal architecture (cc 6.1), according to Table 15 in CUDA C++ Programming Guide, each SM has maximum 64 resident warps and 32 threads per warp so that the max number of resident threads is 2048 per SM.

From the architectural perspective, what architectural feature makes the limit in the max number of resident warps and warp size in max per SM? For example, why can’t each SM have 128 resident warps and 16 threads per warp instead? or 32 resident warps and 64 threads?

Thanks for your time.

rs277 · November 15, 2022, 6:20pm

A warp size of 32 threads has been a hardware constant for all Nvidia GPUs from CC 1.0 to the present CC 9.0. While there is nothing to stop you coding in such a way as to only utilize 16 threads per warp, you will be wasting 50% of the hardware, as the scheduler issues instructions in terms of warps - 32 threads.

As to the limit of why the maximum of 64 warps for CC 6.1, this is probably a hardware tradeoff based around the amount of resources required for the scheduler to juggle warps. Once the currently executing warp stalls, (waiting for a memory request, waiting to run on a particular under pressure functional unit etc), the scheduler parks this warp and runs the next one available that’s ready to run.

njuffa · November 16, 2022, 1:55am

In processor design flexibility leads to complexity. The guiding principle of GPU design is minimizing the complexity of handling control flow and to a lesser extent, data access. This saves square millimeters on the die that can then be used for (1) more execution units, (2) a larger or smarter on-chip memory hierarchy, roughly in that order. For workloads that can benefit from massive parallelism, GPUs owe their performance advantage vs CPUs to focusing on these two aspects.

Note that the die sizes of the highest-performing CPUs and GPUs are close to the limit of what is manufacturable (currently around 850 square millimeters; a Xeon Platinum 9200 die is ~700mm², a H100 die is ~ 810 mm²), so design trade-offs have to be made. One cannot “have it all”. Since larger die size translates to larger cost, these trade-offs similarly apply to lower-cost, lower-performing variants at various price points.

This leads to divergent design philosophies. CPUs are optimized for low latency and irregular control flow and data access patterns, with large on-chip memories and a decent number of execution units. GPUs are optimized for high throughput, regular control flow and data access patterns, with an extremely high number of execution resources and decent size on-chip memories. In the near future we will likely see tightly coupled CPU/GPU combos that reap the benefit of both worlds. One way to achieve this is to build processors from multiple dies (in a single package) which are sometimes called chiplets.

CUDANT · November 16, 2022, 3:53pm

Thanks a lot, njuffa for your detailed answer.

Now I understand why they fix the warp size per SM. I hope NVIDIA provides some documents for their architectures in detail as references for a CUDA programmer at the same time a microprocessor architecture enthusiast like me since practical GPU programming is very dependent on its architecture.

Robert_Crovella · November 16, 2022, 6:38pm

If you’re not already aware of them, the NVIDIA architecture whitepapers may be an interesting resource. The most recent of these are very long documents, so depending on your background you might also want to look at older ones, which are shorter and cover some basic ideas in more detail. Here is the one for A100, for example.

system · November 30, 2022, 6:39pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Relationship between Warp and Thread Block on SM CUDA Programming and Performance cuda	2	507	November 10, 2023
Why is max threads per sm larger than max threads per block? CUDA Programming and Performance	3	1031	January 5, 2024
Scheduling Thread Blocks CUDA Programming and Performance	5	1169	July 29, 2021
CUDA WARPS Conceptual question regarding warps CUDA Programming and Performance	6	3619	May 30, 2008
Question about threads per block and warps per SM CUDA Programming and Performance	13	15848	October 6, 2022
Clarification on concept to hardware mapping CUDA Programming and Performance	2	2249	January 11, 2008
How to the A100 GPU’s maximum warps per scheduler CUDA Programming and Performance	3	296	July 17, 2024
What is the difference between SP and CUDA core? CUDA Programming and Performance	7	7542	October 12, 2021
Basic Cuda Confusion - help CUDA Programming and Performance	9	1900	February 11, 2013
Per-Thread Repeated Access into Small Shared Float Array CUDA Programming and Performance	8	1207	March 26, 2019

Maximum Number of Warps and Warp Size per SM

Related topics