how to calculate theoretical fp32 instructions per cycle (IPC) on nvidia GPU

I’m having a hard time understanding how the theoretical Instructions per Cycle (IPC) for a Fermi architecture nvidia GPU is 2.0

according to http://on-demand.gputechconf.com/gtc-express/2011/presentations/Inst_limited_kernels_Oct2011.pdf page 9.

In section 5.4.1 of the programming guide (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#arithmetic-instructions) for 32-bit floats, there can be 32 fp32-instructions/SM/clock cycle.

How do the two quantities relate?

The IPC refers to a single multiprocessor.
An instruction is considered per-warp, so 32 threads.
So an instruction means 32 operations.

The “Instructions-Per-Cycle” definition is a bit misleading.
Cycle in this definition is related to the clock of warp schedulers
(that’s equal to two clock cycles executed by the CUDA cores, in c.c. 2.0).
c.c. 2.0 has 2 warp schedulers single-issue, so IPC is 2.

Each of these 2 warps is executed in 2 core clock cycles.
So each SM can provide a throughput of (16 + 16) operations per core clock cycle
(that’s the clock considered in section 5.4.1).

thanks Davide!

I am confused with definition of IPC, by instructions we mean instructions from PTX instruction set or SASS(machine code) instruction set. because both of them have different number of instructions after compilation of CUDA C code.

for true performance analysis, the only thing that matters is SASS

Thanks txbob.

fp64 instructions: IPC of 1.0 possible

Now, this is because instead of 16 threads from each warp issued by the dual issue scheduler we are using the complete 32 sp for the calculation hence it is 1.0

But I don’t get how for f64 will use 32 sp. Can you please explain the reason in simple words for the better understanding.