Ladies & Gentlemen,

I’d like to discuss one topic which has been worried me for quite some time. The final drop to fill the buck, which led me to start this post, was the announcement of GTX280 GT200 GPU (http://news.cnet.com/8301-13512_3-9969234-23.html), where it is practically claimed that this is a 1TeraFLOP/s device.

A small detour:

When 880GTX was released, it was claimed that it is able to reach 518.4 GFLOP/s, theoretically of course. Yet the fastest number crunching speed ever reached was about 250-300 GFLOP/s, which is nearly a factor of 2x short (for example, http://progrape.jp/cs/, multiply GFLOP/s numbers by 20.0/38).

The math behind the magic number, 518.4 GFLOP/s, is very simple:

128 cores operating @ 1.35 GHz, and capable to execute three flops per cycle.

Hence:

128x1.35x3 = 518.4

I do not know what kind of math operation it is which allows to execute 3 flops per cycle, and moreover how to make CUDA compiler to use it.

What is true, is that some of the math is compiled into floating multiply-add (FMAD) operation, which execute 2 flops per cycle. Hence, the peak observed performance *could* be:

128 x 1.35 x 2 = 345.6 GFLOP/s

Yet, realistic science applications uses more than just FMAD. Which means some of the operations will be FMAD and some just usual arithmetic operation with 1 flops per cycle. Therefore, the expected performance is actually somewhere between

```
172.8 and 345.6 GFLOP/s,
```

and if intensity of FMAD and non-FMAD operations is about the same, the mean is about 260 GFLOP/s, assuming the application is compute bound. This is the number one should aim for, which is about 50% of theoretical peak performance; still quite good for few hundred quids.

Now back to GTX280, a GT200 based GPU, if I am not mistaken. The claim is usual (http://news.cnet.com/8301-13512_3-9969234-23.html):

a device with 240 cores operating @ 1.296 GHz each, and able to execute 3 flops per cycle per core. Hence, it has theoretical peak performance of

240 x 1.296 x 3 = 933 GFLOP/s.

i.e. we are talking about 1TeraFLOP/s in a desktop. Quite attractive, indeed.

I’m afraid, however, that the situation will be the same as with 8800GTX: one probably will not be able to get these three flops per cycle, and instead will reach only between some

311 and 622 GFLOP/s

The worst part is that double precision, which is badly wanted by the scientific community, instead of reported 90GFLOP/s, may be just between

30 and 60 GFLOP/s

This might be compared to 2x Core2 Quad with double precision SSE, which is incidentally easier to program than GPU; unless I am missing something. In fact, I had a privilege to have an access to a GT200 GPU and to run few codes with high arithmetic intensity. I was able to reach only 300+ GFLOP/s from theoretical peak of 720 GFLOP/s in single precision, and about 30 GFLOP/s in double precision mode. The results are not very encouraging for double precision, as it appears to be 10x smaller.

In any case, I am just trying to understand low level of GPU programming, which could, perhaps, allow to more efficiently use the device.

It is completely possible that I miss something or fail to properly understand. Therefore, any relevant comments, clarification or corrections are more than welcome.

Regards,

Evghenii