About the number of CUDA cores in SMSP, less or gerater than warp threads number(32)

As CUDA C programming guide said, warp threads is 32, then if the number of cuda cores in one SMSP is less or greater than 32, how to schedule for warp scheduller?
(1) If the number of cuda cores in one SMSP is 16, is a warp will be devided into two 16 threads to execute?
(2) If the number of cuda cores in one SMSP is 64, will the warp scheduller dispatach two warps to execute?


Do you have an example of that? The only way it would make sense is if there were:

  • multiple warp schedulers per SMSP

  • the warp scheduler can issue more than 32 threads/clk

AFAIK there is no such GPU that has an SM subdivision into two or more SMSPs with each SMSP having 64 CUDA FP32 cores, and also has either a warp scheduler with more than 32 threads/clk issue rate, or multiple warp schedulers per SMSP. So I have no answer. There is no such animal.

and also has either a warp scheduler with more than 32 threads/clk issue rate

That is hardware restriction?

CUDA C programming guide said a warp includes 32 threads, I think the word ‘warp’ is a definition from software view, but at the hardware layer, there is a warp scheduler that has a maximum threads to schedule at one clk, am I right?

Although dealing primarily with latency, you may find Greg’s reply in this thread, fills in some details.

You mentioned warp scheduler:

According to my observation, the capability of the SMSP warp scheduler in terms of threads/clk is documented in various whitepapers. I don’t know if it is a HW or SW restriction. For example, refer to the V100 whitepaper, p32 fig 5. Or another example is the GA102 whitepaper, p12 fig 3.

That is my understanding based on the two examples I have already provided. I likely won’t be able to respond to further questions about your case 2:

because as far as I know, that is an imaginary case. There is no current CUDA GPU that fits that description, therefore I have no information about it, therefore I have no further comments.

I see, many thanks to you.

I think double issue has something to do with this. In https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf, one warp scheduler can pick one warp (32 threads), then two dispatch units issue two consecutive instructions from the warp.

Theoretically, if you have 64 cores (IPC=1), you should have the ability to issue 2 (warp-level) instructions per cycle to saturate the cores. We know that a warp scheduler with a dispatch unit has the ability of issuing1 (warp-level) instruction per cycle, thus the 64 cores call for two warp schedulers, or one scheduler with two dispatch units. In the former situation we have 2 * 32 cores and 2 * warp schedulers, we can split them into 2 SMSP. The latter situation is actually the double issue.

I don’t know why NVIDIA abandoned double issue in new architecture (like A100). I think It’s good for improving the IPC of issuing instructions. For example, if we have double issue in H100, we can issue 2 instructions per cycle where 1 instruction to FP32 (32 cores per SMSP) and another to LSU (load shared memory). Anyone can help with this question?

Probably not worth the added complexity. Typically units have no more than 16 threads/clk throughput anyway. So at least two units can be fully utilized at a time.

Shared memory is slower; if all 4 SMSP access it, its thoughput is only 1/4 of its max speed per SMSP. The performance loss of the arithmetic units is small, if there is an intermittent LSU here and there.

The Cuda architecture is optimized for a constant very high level of throughput, but not for having all possible units fully active at the same time.

Many kernels have huge bottlenecks elsewhere, e.g. are memory-bound or by cache-size.

The few kernels, which achieve nearly the theoretical optimum number of instruction dispatches, are typically limited by their high density of basic arithmetic operations (especially floating point fused-multiply-add). The architecture is optimized to fully use the available units. Every other instruction mix needs a compromise anyway. And a lot of those kernels do not need 100% arithmetic compute performance.

If you had one of those rare kernels, which would profit from full arithmetic throughput + some other units, you probably would not be happy, if the number of SMs would be reduced, as the die area is limited and the complexity of several dispatch units took its toll. Or that the base and boost frequency would have to be reduced.

You may assume that the SM architecture engineers at Nvidia have not just overlooked that more instructions could be dispatched, but that they compared many different scenarios and implementations and that they simulate the runtime speed of a large amount of real-world kernels with different instruction mixes for possible architecture optimizations to find the ideal compromise.

Each two years they have to show a speed-up compared to the previous generation. Ideally not only by smaller semiconductor process sizes, but also by optimizing the SM architecture. Since 7 years all Nvidia GPUs use Volta-like architectures (and no end in sight for the next years). If there is a low-hanging fruit for just 10% speed-up for the overall GPU, they would gladly take it. And they are finding tweaks, sometimes general optimizations, sometimes domain specific. E.g. the uniform datapath or sparse tensor matrices.

1 Like

Thanks! I am not a pro in hardware design but I know it is a complex project that involves many trade-offs. Most likely, nvidia engineers should have done research on the dispatch unit design and decided to abandon multiple issue, for multiple reasons, both software and hardware.