On Max IPC, IPC, compute utilisation for the P100

It is documented in Programming Guide :: CUDA Toolkit Documentation that the P100 as a GPU with compute capability of 6.0 has two schedulers per SM where each scheduler “each scheduler issues one instruction for one of its assigned warps that is ready to execute”. This should imply that the Max IPC per SM is equal to 2, but nvprof/nvvp will show that the Max IPC is 3 as per below snapshot

image

Does anybody know why?

I have also noticed that ipc metric from nvprof tracks very well compute utilisation that is shown in graphs in nvvp. Are these really correlated? With all the experiments I am getting the impression the P100 schedulers do not handle memory related instructions, but there is an extra scheduler for that. Anyone can shed some light on the matter?

The CC 3.0 - CC6.x warp scheduler in each SM sub-partition (SMSP) can dispatch two warp instructions per clock. The instructions are from the same warp and must be independent of each other.

The CC 7.0 - CC 8.* warp scheduler in each SM sub-partition [SMSP] can dispatch one warp instruction per clock.

The CUDA Programming Guide is incorrect. The CUDA profilers and the whitepapers on Maxwell, GP100, Pascal, Volta, Turing have the correct value. I have filed a bug with the CUDA Documentation team.

For example in the NVIDIA Tesla P100 Whitepaper p.12 states “Each warp scheduler (one per processing block) is capable of dispatching two warp instructions per clock.”

I can’t get this. Why Max IPC is 3 (reported by the profiler) when the warp scheduler can do two instructions per clock?..Max IPC should be some even number

Regards
Daniel

GP100 has 2 warp schedulers (SMSP) per SM so max SM IPC is 3.
GMxxx/GP10x have 4 warps schedulers (SMSP) per SM so max SM IPC is 6.
The value shown in the UI is the max issue rate per SM (Multiprocessor).

The reason the value is not SMSP/SM x 2 instructions/cycle is because the SM has other limits that only allowed
sustained of IPC of 1.5 instructions per cycle per SMSP. The profiling tools show the sustained IPC vs. the per cycle burst rate.

Greg, Thanks a lot for the very good and helpful answer.