GTX280/GT200 GPU Can you really reach 1TFLOP/s?

Ladies & Gentlemen,

I’d like to discuss one topic which has been worried me for quite some time. The final drop to fill the buck, which led me to start this post, was the announcement of GTX280 GT200 GPU (http://news.cnet.com/8301-13512_3-9969234-23.html), where it is practically claimed that this is a 1TeraFLOP/s device.

A small detour:
When 880GTX was released, it was claimed that it is able to reach 518.4 GFLOP/s, theoretically of course. Yet the fastest number crunching speed ever reached was about 250-300 GFLOP/s, which is nearly a factor of 2x short (for example, http://progrape.jp/cs/, multiply GFLOP/s numbers by 20.0/38).

The math behind the magic number, 518.4 GFLOP/s, is very simple:
128 cores operating @ 1.35 GHz, and capable to execute three flops per cycle.
Hence:
128x1.35x3 = 518.4

I do not know what kind of math operation it is which allows to execute 3 flops per cycle, and moreover how to make CUDA compiler to use it.

What is true, is that some of the math is compiled into floating multiply-add (FMAD) operation, which execute 2 flops per cycle. Hence, the peak observed performance could be:
128 x 1.35 x 2 = 345.6 GFLOP/s

Yet, realistic science applications uses more than just FMAD. Which means some of the operations will be FMAD and some just usual arithmetic operation with 1 flops per cycle. Therefore, the expected performance is actually somewhere between

 172.8 and 345.6 GFLOP/s,

and if intensity of FMAD and non-FMAD operations is about the same, the mean is about 260 GFLOP/s, assuming the application is compute bound. This is the number one should aim for, which is about 50% of theoretical peak performance; still quite good for few hundred quids.

Now back to GTX280, a GT200 based GPU, if I am not mistaken. The claim is usual (http://news.cnet.com/8301-13512_3-9969234-23.html):
a device with 240 cores operating @ 1.296 GHz each, and able to execute 3 flops per cycle per core. Hence, it has theoretical peak performance of

240 x 1.296 x 3 = 933 GFLOP/s.

i.e. we are talking about 1TeraFLOP/s in a desktop. Quite attractive, indeed.
I’m afraid, however, that the situation will be the same as with 8800GTX: one probably will not be able to get these three flops per cycle, and instead will reach only between some

311 and 622 GFLOP/s

The worst part is that double precision, which is badly wanted by the scientific community, instead of reported 90GFLOP/s, may be just between

30 and 60 GFLOP/s

This might be compared to 2x Core2 Quad with double precision SSE, which is incidentally easier to program than GPU; unless I am missing something. In fact, I had a privilege to have an access to a GT200 GPU and to run few codes with high arithmetic intensity. I was able to reach only 300+ GFLOP/s from theoretical peak of 720 GFLOP/s in single precision, and about 30 GFLOP/s in double precision mode. The results are not very encouraging for double precision, as it appears to be 10x smaller.

In any case, I am just trying to understand low level of GPU programming, which could, perhaps, allow to more efficiently use the device.

It is completely possible that I miss something or fail to properly understand. Therefore, any relevant comments, clarification or corrections are more than welcome.

Regards,
Evghenii

ok first off i am no cuda expert, far from it i have never used it ^_^ but when a company lists the spec, they will always say the best it can perform for obvious reasons, they dont say “it sometimes only gets 300gigaflops” because that doesnt sound good, instead they say “it CAN do 933gigaflops” and as you say, the math works out, you just need to have highly optemised code to achieve this.

I have a short question about your code. How many gigaflops were you getting with a G80 or G92 GPU?

With my code, I am reaching reaching 225 GFLOP/s on 8800Ultra and 225/1.5*1.8 ~ 270 GFLOP/s on over-clocked 8800GTS(512) (1.78GHz per core instead of 1.625 GHz).

Basically, to sum up my previous post: is there any way to reach claimed theoretical peak performance in practice, and if there is how this can be done, or have anybody ever come within 10-20% of the peak performance?

Take a moment to read the anandtech article : http://www.anandtech.com/video/showdoc.aspx?i=3334&p=1

It talks about the fact that it should be possible to do a MAD & MUL at the same time, but that there was some trouble on G80/G92 to make this actually happen. On GT200 this should have been fixed.

Anyway, if you have some code that benchmarks the amount of GFLOPS, I can try it out for you on a pre-production version.

Thanks for the link, I found it quite interesting. It appears that if one could write code where FMAD & FMUL follow one after another, one should be able reach 3 flops per cycle. It makes me curious, though, in how many cases this will be possible. I guess, there will be reports very soon.

Thanks for the offer. I have already tested the code on a pre-production GT200 (with 1GHz per core, instead of 1.3GHz), and I found that in single precision mode it performs slightly faster than an overclocked 8800GTS(512), by about 10-20% (which is slightly larger than 240x1.0/(128x1.78), but this could be because the device have 2x number of registers and there was no register spill to lmem). In double-precision, unfortunately, I was able to reach only 25-30 GFLOP/s (the operation is too expensive in tthis mode). This have made me wonder, if FMAD+MUL is indeed working, but I could be missing something from my code.

E.

If I remember right, the x3 comes when you use everything on the card. As you already pointed out with combined Multiply-Add you reach x2, but they added on top of it the texture operations too. When you fetch via texture you can to a (implicit) linear transformation to the index, and I think they include those operations to their peak performance (someone correct me, if this is wrong).