Question about computing GFLOPS Do fabs and a=-b instructions count?

Hello,

Is it correct to count 1 flop for each fabs instruction? and to count 1 flop for each a=-b instruction (a and b are floats)?

Thanks.

It depends, but usually no.

There is no good definition, but the most useful is one that’s a stable reference… for example the LINPACK benchmarks, which produce results measured in FLOPS and are directly comparable over different computers.

But for marketing, people get fuzzy because they just want a big number… so you pad it by saying things like “A texture read does interpolation, so let’s add extra FLOPS for the math we’d have to use if we did it with a core.” NV is guilty of this, and so is AMD, and Intel, and pretty much any hardware maker.

Note that a stable reference like LINPACK avoids this because all that matters is the speed you generate a known result, no matter how you do it.

Even then, the problem is that each company tells different lies in their marketing literature. If the lies were standardised, it would be less of a problem :devil: :D

I found GFLOPS numbers announced by vendors to be rather moot too, and on the other side dealing with these numbers is necessity, in order to be able to set some kind of expectations when implementing an algorithm on the GPU, or say to compare results of CPU and GPU versions of some code.

Here, one can find GFLOPS numbers Intel reported for own processors. I find this table rather helpful, as its hard to find exact data on clock rates and number of SSE units for given Intel processor; still, these numbers should be used with care, as Intel seem to use multiply-add instruction as base for these calculations (thus, they are counting 2 floating point instructions per cycle), and it’s hard to expect that all of some code will compile to this type of instruction. So these numbers should be usually divided by 2, but then again these are reported for double-precision floating point operations, so multiplication by 2 is in order if one is actually using single-precision operations.

NVIDIA seem to be even worse than Intel in this stuff: For example, Tesla C1060, clocked at 1.3GHz is reported here as capable of 933 GFLOPS. As it has 240 cores, obviously they use 3 floating point operations per cycle as basis for calculations, and as far as I know this is only possible for texture lookup with interpolation operation; again, hard to believe one’s code would consist only of that kind of operation. Thus usually I just divide the number reported for given hardware by NVIDIA by 3, so then if I get my GPU code that is close to (GFLOPS-reported-by-NVIDIA/3)/GFLOPS-reported-by-Intel times faster than known fast version of CPU code, than I know I did very good job in my CUDA coding. But an additional issue in these comparisons is that is usually hard to find good CPU codes to compare with (and if CPU is not target platform along with the GPU, it’s hard to justify spending time on optimizing CPU codes just for the purpose of fair comparison), so oftentimes even we as developers end up lying - for example, over there at CUDA Zone one could find number of completely un-realistical speed-ups reported.

This is not how the GFLOPS are measured on GT200 devices. The factor of three comes from a MAD (multiply add operation, which is two FLOPS) which are executed by each CUDA core and a MULTIPLY operation which is executed by the special function unit (SFU). So you will only see the peak performance if you have exactly this ratio of 1 MAD with 1 MUL (and many other mitigating factors like all variables being inregisters no shared memory etc.). The flops that can be obtained from texture linear interpolation actually count in addition to the 933 peak performance, but using these flops effectively in a scientific application is difficult, so these flops are not counted in the peak flops total. For comparing against other architectures, some advocate comparing only MAD rates (or FMAD rates in the case Fermi) since this is what LINPACK boils down to doing. So for matrix-multiplication the peak GFLOPS is 2/3 * 933 = 622 Gflops. There was a recent paper by Yifeng Chen (" Improving Performance of Matrix Multiplication and FFT on GPU") demonstrating that it is possible to obtain 620 out 622 of these Gflops in special circumstances, and he managed to improve on CUBLAS’s SGEMM performance obtaining 400 Glfops on a Tesla C1060.

Interesting. Could you provide a link to the paper (“Improving Performance of Matrix Multiplication and FFT on GPU”) ?

The paper was presented at this conference. I don’t think I can post the pdf here.

@p96159: Wait, I just made a check (over there at Real World Tech. article), and while indeed I was wrong in that peak GFLOPS for GT200 could be achieved only if texture operations involved, I still understand it that multipliers within SFU could be used either for the texture interpolation, or multiply instruction, so how exactly texture interpolation would then add to peak GFLOPS reported?

Good find. That’s the first time I’ve seen it mentioned that the SFU does the texture interpolation. So what does the texture unit do then, if it is the SFU that does the texture interpolation?

This raises other questions with me though:

[list=1]

The Gtexel rate changes depending upon whether one is processing 32 bit floats, 16 bit floats or 8 bit integers if I recall. The MUL operation is a 32 float operation. Does this imply that it is possible to do FP16 or 8 bit integer multiply operations natively on the SFU faster than the FP32 MUL operation? It would seem that this functionality isn’t exposed at the CUDA level anyway.

How does one convert Gtexel rate into effective Gflops. If the SFU is used for the interpolation, presumably the SFU MUL rate of 311 Gflops should be somehow similar to the Gtexel rate of 48.2 Gtexel/s (GTX 280)?

One of the restrictions preventing me from using texture interpolation in one of my applications is the requirement that the interpolant weights must be within [0,1] - I require negative interpolants. Since the SFU MUL operation is capable of full FP32 multiplication, this suggests to me that the hardware is capable of signed multiplication, so perhaps this functionality is present, but not exposed?

For Fermi, the MUL operation seems not to be present anymore, so the texture operations do count in addition to the raw flops count.

Further to my ponderings above, this has also got me wondering about where the integer -> floating point conversion is performed when reading textures of integers, setting the read mode to cudaReadModeNormalizedFloat? Is this conversion handled by the texture unit or the special function unit? What about the intrinsics for conversion between FP32 and FP16: do these use fixed function hardware, or the SFU to achieve conversion?

Actually the values in the table are based on a 128-bit (4 floats) SSE register operation per clock cycle per core. That’s 16 single-precision operations per clock cycle. My test cases obtain around half of that, so I find those numbers far more reasonable than NVIDIA’s 933 GFLOP/s.

I’ve just had confirmation from an NVIDIA engineer that the Real World tech article is wrong (or at least it is not referring to texture interpolation, it could be vertex parameters). Texture interpolation is for free, i.e., any flops you can gain through the texture unit is in addition to those available from the MAD and MUL operations. So my original statement is true.

No, the numbers at the link above seem to be based on double precision operations. I’ve asked exactly the same question on Intel Software Forums recently, see here, and it seems peek GFLOPS calculation for Core 2 processors is, for single-precision operations is as follows: 4 (number of single-precision operands in SSE register) times 2 (each core has separated add and multiply units) times number of cores times clock frequency in GHz. So, for example for my P8600 Core 2 Duo CPU, peak GFLOPS number should be 422*2.4=38.4, while the number reported by Intel is 19.2, that is exact half of 38.4, i.e. on this particular page they are reporting GFLOPS for double-precision operations (in which case the first multiplier is 2 instead of 4).

Ok, so that would mean that if you have addps-mulps, addps-mulps continuosly, you could achieve that performance. I didn’t realize that new processors have that ability. That’s good to know when writing SSE code. Thanks for the info.

Alex