I see the ncu analysis the Warp scheduler statistics state GPU Maximum Warps Per scheduler is 16. This mean one cycle can issue 16 warps per Scheduler? How to compute this ?
The schedulers (1 per SMSP → 4 per SM) switch between the assigned warps. Up to 16 warps (from one or several blocks, from one or several kernels) can be resident on each SMSP.
With 4 * 16 = 64
warps per SM you get a maximum of 32 * 64 = 2048
threads per multiprocessor for this GPU, which is perhaps more familiar. Both numbers are closely related, as warps are the relevant granularity for SM hardware limits.
Each scheduler (of current architectures) can only issue 1 warp per cycle.
Thank you very much! This post [How many thread are executed at the same time?] (How many thread are executed at the same time ?) said, Each scheduler can dual-issue each cycle. And it calculates that each SM can simultaneously fire 256 threads per cycle. Is this calculation correct? How to get the relevant parameter information? K1 is also quite old GPU architecture.
Tegra K1 is compute capability (= architecture) 3.2. At that time the SMs were not partitioned yet (or had one partition ;-)).
The modern partitioning involves more than assigning warps to specific schedulers. In modern architectures the registers and arithmetic (INT and FP32) units are partitioned, too.
Nevertheless there were 4 schedulers on the Kepler architecture and each could schedule up to 2 instructions (from the same thread) per cycle.
4 schedulers * 32 threads/warp = 128 threads/cycle
.
4 schedulers * 32 threads/warp * 2 instructions/thread = 256 instructions/cycle
.
Those numbers are a theoretical maximum.
I would expect realistic numbers (especially the 256 instructions) to be much less, even with optimized code.
Perhaps more interesting (for calculating performance) is the 192 CUDA cores for arithmetic instructions. For some reason the (old) programming guides state that a maximum of 160 arithmetic instructions could be scheduled each cycle. Does anybody know, where the 160 comes from? Or is the 192 wrong? The SM (or SMX as it was called) drawings show 192 Cores.