# Calculation of the maximum performance of the Tesla K40 GPU (general discussion)

Hi,
I have noticed the peak performance of the K40 GPU is 4290 Gflops and it is calculated as follows:

GPUpeak = no. of cores x threads per core x op per thread x frequency
= 15 x 192 x 2op/cycle x 0.745 GHz
= 4290 Gflops

However, I red a SM of the GPU can execute only N (N=4 is it?) warps at a time. This N is not the active warps but the actual exicution using floating point units. Therefore following calculation,

= 0.745 GHz x 32 x 2 x 4 x 15
= 2860.8 Gflops

has a big difference (66% of the peak). Please explain the followings.

1. Are my above calculations correct?
2. If incorrect, how the peak performance calculation is done in terms of warps?
3. How many warps can be executed at a time (not the active warps)?

My understanding is the active warps are just waiting to receive data from the memory and don’t
perform floating point operations. They hide the memory latency. But to calculate actual performance
we need to take actual execution of warps.

Thank you.

Your computation of single-precision peak performance arrives at the correct result, but it is actually:

15 SMs x 192 (single-precision execution units/SM) x 1 (FMA/execution unit/cycle) x 2 (floating-point operations/FMA) x 745 MHz = 4.29 TFLOPS.

The limiter is the number of execution units per SM. You can certainly run more that 192 threads on each SM, in fact that is what is desired to create enough parallelism such that basic latencies are covered. Typically one would want to run 1024 or more threads per SM.

Hi, Thank you for the reply.

That means SM can execute 192/32 = 6 warps per clock and it is equal to the peak performance (4.29 TFLOPS).