How to the A100 GPU’s maximum warps per scheduler

1354998814 · July 17, 2024, 4:25am

I see the ncu analysis the Warp scheduler statistics state GPU Maximum Warps Per scheduler is 16. This mean one cycle can issue 16 warps per Scheduler? How to compute this ?

Curefab · July 17, 2024, 8:51am

The schedulers (1 per SMSP → 4 per SM) switch between the assigned warps. Up to 16 warps (from one or several blocks, from one or several kernels) can be resident on each SMSP.

With 4 * 16 = 64 warps per SM you get a maximum of 32 * 64 = 2048 threads per multiprocessor for this GPU, which is perhaps more familiar. Both numbers are closely related, as warps are the relevant granularity for SM hardware limits.

Each scheduler (of current architectures) can only issue 1 warp per cycle.

1354998814 · July 17, 2024, 9:06am

Thank you very much! This post [How many thread are executed at the same time?] (How many thread are executed at the same time ?) said, Each scheduler can dual-issue each cycle. And it calculates that each SM can simultaneously fire 256 threads per cycle. Is this calculation correct? How to get the relevant parameter information? K1 is also quite old GPU architecture.

Curefab · July 17, 2024, 9:17am

Tegra K1 is compute capability (= architecture) 3.2. At that time the SMs were not partitioned yet (or had one partition ;-)).

The modern partitioning involves more than assigning warps to specific schedulers. In modern architectures the registers and arithmetic (INT and FP32) units are partitioned, too.

Nevertheless there were 4 schedulers on the Kepler architecture and each could schedule up to 2 instructions (from the same thread) per cycle.

4 schedulers * 32 threads/warp = 128 threads/cycle.
4 schedulers * 32 threads/warp * 2 instructions/thread = 256 instructions/cycle.

Those numbers are a theoretical maximum.
I would expect realistic numbers (especially the 256 instructions) to be much less, even with optimized code.

Perhaps more interesting (for calculating performance) is the 192 CUDA cores for arithmetic instructions. For some reason the (old) programming guides state that a maximum of 160 arithmetic instructions could be scheduled each cycle. Does anybody know, where the 160 comes from? Or is the 192 wrong? The SM (or SMX as it was called) drawings show 192 Cores.

Topic		Replies	Views
Scheduling Thread Blocks CUDA Programming and Performance	5	1171	July 29, 2021
How many thread are executed at the same time ? CUDA Programming and Performance	9	7829	January 21, 2024
Scheduling threads as Warps CUDA Programming and Performance	3	872	July 11, 2013
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2447	July 4, 2019
768 threads vs warp CUDA Programming and Performance	2	1458	August 16, 2009
Can warps from different CTAs be coscheduled? CUDA Programming and Performance	5	212	July 6, 2024
Maximum Number of Warps and Warp Size per SM CUDA Programming and Performance cuda , gpu , architecture-and-design	5	7049	November 30, 2022
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1425	April 15, 2009
thread, warp, block, grid, device CUDA Programming and Performance	3	6323	November 25, 2016
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	2977	March 7, 2017

How to the A100 GPU’s maximum warps per scheduler

Related topics