Hi, I’d like to make sure of my understanding about the max number of warps and warp size per SM. Sorry for this naive question. For Pascal architecture (cc 6.1), according to Table 15 in CUDA C++ Programming Guide, each SM has maximum 64 resident warps and 32 threads per warp so that the max number of resident threads is 2048 per SM.
From the architectural perspective, what architectural feature makes the limit in the max number of resident warps and warp size in max per SM? For example, why can’t each SM have 128 resident warps and 16 threads per warp instead? or 32 resident warps and 64 threads?
A warp size of 32 threads has been a hardware constant for all Nvidia GPUs from CC 1.0 to the present CC 9.0. While there is nothing to stop you coding in such a way as to only utilize 16 threads per warp, you will be wasting 50% of the hardware, as the scheduler issues instructions in terms of warps - 32 threads.
As to the limit of why the maximum of 64 warps for CC 6.1, this is probably a hardware tradeoff based around the amount of resources required for the scheduler to juggle warps. Once the currently executing warp stalls, (waiting for a memory request, waiting to run on a particular under pressure functional unit etc), the scheduler parks this warp and runs the next one available that’s ready to run.
In processor design flexibility leads to complexity. The guiding principle of GPU design is minimizing the complexity of handling control flow and to a lesser extent, data access. This saves square millimeters on the die that can then be used for (1) more execution units, (2) a larger or smarter on-chip memory hierarchy, roughly in that order. For workloads that can benefit from massive parallelism, GPUs owe their performance advantage vs CPUs to focusing on these two aspects.
Note that the die sizes of the highest-performing CPUs and GPUs are close to the limit of what is manufacturable (currently around 850 square millimeters; a Xeon Platinum 9200 die is ~700mm2, a H100 die is ~ 810 mm2), so design trade-offs have to be made. One cannot “have it all”. Since larger die size translates to larger cost, these trade-offs similarly apply to lower-cost, lower-performing variants at various price points.
This leads to divergent design philosophies. CPUs are optimized for low latency and irregular control flow and data access patterns, with large on-chip memories and a decent number of execution units. GPUs are optimized for high throughput, regular control flow and data access patterns, with an extremely high number of execution resources and decent size on-chip memories. In the near future we will likely see tightly coupled CPU/GPU combos that reap the benefit of both worlds. One way to achieve this is to build processors from multiple dies (in a single package) which are sometimes called chiplets.
Now I understand why they fix the warp size per SM. I hope NVIDIA provides some documents for their architectures in detail as references for a CUDA programmer at the same time a microprocessor architecture enthusiast like me since practical GPU programming is very dependent on its architecture.
If you’re not already aware of them, the NVIDIA architecture whitepapers may be an interesting resource. The most recent of these are very long documents, so depending on your background you might also want to look at older ones, which are shorter and cover some basic ideas in more detail. Here is the one for A100, for example.