Calculation of the maximum performance of the Tesla K40 GPU (general discussion)

samarawickrama · January 14, 2015, 10:52pm

Hi,
I have noticed the peak performance of the K40 GPU is 4290 Gflops and it is calculated as follows:

GPUpeak = no. of cores x threads per core x op per thread x frequency
= 15 x 192 x 2op/cycle x 0.745 GHz
= 4290 Gflops

However, I red a SM of the GPU can execute only N (N=4 is it?) warps at a time. This N is not the active warps but the actual exicution using floating point units. Therefore following calculation,

GPUpeak = frequency x threadsPerWarp x opPerThread x warpsPerClock x SMs
= 0.745 GHz x 32 x 2 x 4 x 15
= 2860.8 Gflops

has a big difference (66% of the peak). Please explain the followings.

Are my above calculations correct?
If incorrect, how the peak performance calculation is done in terms of warps?
How many warps can be executed at a time (not the active warps)?

My understanding is the active warps are just waiting to receive data from the memory and don’t
perform floating point operations. They hide the memory latency. But to calculate actual performance
we need to take actual execution of warps.

Thank you.

njuffa · January 14, 2015, 11:23pm

Your computation of single-precision peak performance arrives at the correct result, but it is actually:

15 SMs x 192 (single-precision execution units/SM) x 1 (FMA/execution unit/cycle) x 2 (floating-point operations/FMA) x 745 MHz = 4.29 TFLOPS.

The limiter is the number of execution units per SM. You can certainly run more that 192 threads on each SM, in fact that is what is desired to create enough parallelism such that basic latencies are covered. Typically one would want to run 1024 or more threads per SM.

samarawickrama · January 15, 2015, 1:42am

Hi, Thank you for the reply.

That means SM can execute 192/32 = 6 warps per clock and it is equal to the peak performance (4.29 TFLOPS).

GPUpeak = frequency x threadsPerWarp x opPerThread x warpsPerClock x SMs
= 0.745 GHz x 32 x 2 x 6 x 15
= 4.29 Tflops (peak performance)

Even though, many threads can be activated, only 192 threads can be computed per clock based on the resources.

Gert-Jan · January 15, 2015, 8:33am

According to the Kepler whitepaper an SM can execute 4 warps per clock, but also two instructions per warp. Quote from the whitepaper (page 9): “Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle.”