It is documented in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-6-x that the P100 as a GPU with compute capability of 6.0 has two schedulers per SM where each scheduler “each scheduler issues one instruction for one of its assigned warps that is ready to execute”. This should imply that the Max IPC per SM is equal to 2, but nvprof/nvvp will show that the Max IPC is 3 as per below snapshot
Does anybody know why?
I have also noticed that ipc metric from nvprof tracks very well compute utilisation that is shown in graphs in nvvp. Are these really correlated? With all the experiments I am getting the impression the P100 schedulers do not handle memory related instructions, but there is an extra scheduler for that. Anyone can shed some light on the matter?