I know it is a useless point, but I was trying to see how this peak performance is computed. I succeeded in deciphering part of it, but there is a small section I can’t explain.

The idea is to have the highest throughput instruction of each SM per cycle, and see what they can all do in a second.

The highest throughput is naturally the single precision floating point ops, as shown in the programming guide section 5.4.1… This throughput is 32 ops per cycle for an SM, with compute capability 2.0. There are small tricks here, but this is how it goes for the Tesla C2050 (compute capability 2.0):

The trick here is the use of FMA units which is considered as 2 flops, doubling the throughput of the operation.

This also applies to double precision with half the original throughput putting it at 515.2 GFlops/s

The interesting part is the special function units. These are independent of the cores that do the single precision and double precision ops, and can run in parallel with them. However these have a low throughput, namely 4 per cycle/SM. Even with the trick of counting reciprocal square root as 2 ops, it only puts its additional Flops/s to 128.8 GFlops/s, so a total of 1159.2 GFlops/s, exactly 128.8 GFlop/s short of the reported 1288 GFlops/s(FMA+SFU).

What am I missing here? Is there a special operation that counts as 4 flops?

Where has the value of 1288 GFlop/s (I assume you meant GFlop/s, not MFlop/s) been reported? I thought that only compute capability 1.x (and maybe 2.1 and higher), but not 2.0 were able to dual-issue instructions to FPUs/“cores” and SFUs.

check the first column under GFlops/s. Other sources say the same…

Again, it is all due to the difference between the cores and SFUs, they can be done in parallel, which should be possible with two warp scheduler units I suppose. Remains to make a problem with the right balance, and right distribution to make use of it all.

But still, I can’t seem to calculate 1288, just a bit short of it.

The published number of 1030 single-precision GFLOPS for the C2050 looks correct to me. This basically assumes issuing nothing but a long stream of single-precision FMAs, and as far as I know under such conditions there is no issue bandwidth left for additional SFU instructions. Looking at the table in Wikipedia, I am unable to tell from which sources it derived the higher number of 1288 GFLOPS.

There is this paper which mentions a roofline model for the C2050 and does indeed take into account the SFU and FMA capabilities together, http://arxiv.org/pdf/1108.5815, page 4.

I find the roofline model to be a really useful way to capture the performance of kernels for different problem, and I want to situate my own program in its context, so I want to understand its details, this 1288 GFlops business being one one of those details.