GTX285 vs C1060 vs GTX480 GFLOP/s ?

So ive been working with a gtx285 and a c1060 for about a year.
For some reason I assumed the the gtx285 had 933GFLOPs and the c1060 the same just a slower mem clock.
I guess i never questioned it.

Today I was trying to calculate out the theoretical performance to a friend and i could not make sense of anything.

so my gtx285 is 240 cores at 1.48ghz with 3 operations/cycle. that would be 1065.6 GFLOPs is right?
then the c1060 is 240 cores at 1.3ghz with 3 operations/cycle. that would be 936 GFLOPs is this right?

so i concluded that those results are close to 933 GFLOPs… so i figured id double check with the gtx480, this is where i completely lost all sense of knowing the answer.

gtx480 has 480 cores at 1.401ghz with 3 operations/cycle. that would be 2017.44 GFLOPs… and when i looked at the specs i could only find something about 1350 GFLOPs…

What am i missing or reasoning wrong?

FLOP counting is a little confusing because of the dual-issue capabilities. All CUDA cores can complete one instruction (at least the basics) per clock cycle. This includes a single precision floating point multiply-add, which counts as two operations. In addition, all of the compute capability 1.x devices had the ability, in principle, to dual-issue a multiply instruction that was executed by another part of the multiprocessor. That’s where the third operation comes from in the peak GFLOPS estimate.

In the original CUDA GPUs, there was a problem and the dual issue often did not happen even when there was a multiply instruction available. Later GPUs fixed this (not sure if it was G92 or GT200), but the dual issue multiply was still of limited use, except as a gimmick to inflate the peak GFLOPS numbers. In Fermi, it seems that they have removed it.

So to reproduce the NVIDIA calculation of peak GFLOPS, you multiply clock * CUDA cores * 3 for pre-Fermi and clock * cores * 2 for Fermi. Personally, I would not get too caught up in the difference. I mostly compare GPUs looking just at clock * cores (i.e., instruction throughput) and memory bandwidth.