On Max IPC, IPC, compute utilisation for the P100

ddmusc · September 19, 2020, 1:50pm

It is documented in Programming Guide :: CUDA Toolkit Documentation that the P100 as a GPU with compute capability of 6.0 has two schedulers per SM where each scheduler “each scheduler issues one instruction for one of its assigned warps that is ready to execute”. This should imply that the Max IPC per SM is equal to 2, but nvprof/nvvp will show that the Max IPC is 3 as per below snapshot

Does anybody know why?

I have also noticed that ipc metric from nvprof tracks very well compute utilisation that is shown in graphs in nvvp. Are these really correlated? With all the experiments I am getting the impression the P100 schedulers do not handle memory related instructions, but there is an extra scheduler for that. Anyone can shed some light on the matter?

Greg · September 20, 2020, 11:42pm

The CC 3.0 - CC6.x warp scheduler in each SM sub-partition (SMSP) can dispatch two warp instructions per clock. The instructions are from the same warp and must be independent of each other.

The CC 7.0 - CC 8.* warp scheduler in each SM sub-partition [SMSP] can dispatch one warp instruction per clock.

The CUDA Programming Guide is incorrect. The CUDA profilers and the whitepapers on Maxwell, GP100, Pascal, Volta, Turing have the correct value. I have filed a bug with the CUDA Documentation team.

For example in the NVIDIA Tesla P100 Whitepaper p.12 states “Each warp scheduler (one per processing block) is capable of dispatching two warp instructions per clock.”

ddmusc · September 21, 2020, 3:02pm

I can’t get this. Why Max IPC is 3 (reported by the profiler) when the warp scheduler can do two instructions per clock?..Max IPC should be some even number

Regards
Daniel

Greg · September 21, 2020, 6:35pm

GP100 has 2 warp schedulers (SMSP) per SM so max SM IPC is 3.
GMxxx/GP10x have 4 warps schedulers (SMSP) per SM so max SM IPC is 6.
The value shown in the UI is the max issue rate per SM (Multiprocessor).

The reason the value is not SMSP/SM x 2 instructions/cycle is because the SM has other limits that only allowed
sustained of IPC of 1.5 instructions per cycle per SMSP. The profiling tools show the sustained IPC vs. the per cycle burst rate.

ddmusc · October 11, 2020, 7:45pm

Greg, Thanks a lot for the very good and helpful answer.

Topic		Replies	Views
what is IPC(Instructions Per Cycle)? CUDA Programming and Performance	2	3031	October 15, 2018
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	322	March 26, 2024
Max IPC of 3080 CUDA Programming and Performance	4	666	October 12, 2021
What is MAX IPC which is shown in properties view for device? CUDA Programming and Performance	4	1111	June 20, 2017
What can be learned from IPC (via nvprof)? CUDA Programming and Performance	9	3180	July 13, 2018
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	781	June 17, 2024
How to the A100 GPU’s maximum warps per scheduler CUDA Programming and Performance	3	266	July 17, 2024
Question regarding Pascal architecture CUDA Programming and Performance	13	2951	March 16, 2017
GT 200 performance questions Is it possible to achieve IPC > 1? CUDA Programming and Performance	5	4118	January 7, 2009
how to calculate theoretical fp32 instructions per cycle (IPC) on nvidia GPU CUDA Programming and Performance	6	5422	July 9, 2017

On Max IPC, IPC, compute utilisation for the P100

Related topics