Mythical Tflops

I decided to measure the execution time of simple commands: i *= 1 or i += 1.
I tried i float and integer.

Exemple code

__global__ void addKernel()
        float i1 = 1;float i2 = 1;float i3 = 1;float i4 = 1;float i5 = 1;
        float i6 = 1;float i7 = 1;float i8 = 1;float i9 = 1;float i10 = 1;

	i1 *= 1;  //string1
	i2 *= 1;  //string1
	i3 *= 1;  //string1
	i4 *= 1;  //string1
	i5 *= 1;  //string1
	i6 *= 1;  //string1
	i7 *= 1;  //string1
	i8 *= 1;  //string1
	i9 *= 1;  //string1
	i10 *= 1;  //string1
	i1 *= 1;  //string2
	i2 *= 1;  //string2
	i3 *= 1;  //string2
	i4 *= 1;  //string2
	i5 *= 1;  //string2
	i6 *= 1;  //string2
	i7 *= 1;  //string2
	i8 *= 1;  //string2
	i9 *= 1;  //string2
	i10 *= 1;  //string2
	i1 *= 1;  //string1000
	i2 *= 1;  //string1000
	i3 *= 1;  //string1000
	i4 *= 1;  //string1000
	i5 *= 1;  //string1000
	i6 *= 1;  //string1000
	i7 *= 1;  //string1000
	i8 *= 1;  //string1000
	i9 *= 1;  //string1000
	i10 *= 1;  //string1000

The CUDA assembler gives us ONLY one command for this operation.
For Float FMUL, FADD;
for integer IMUL, IADD

There are no cycles here, there is only one command executed strictly 10,000 times within a single thread using exclusively the CUDA register.
On my GTX 660, I have time to execute the command
5 tick IMUL

GPU Clock 1032 MHz = 1,032GHz
CUDA processors 980
1 operation - 3 tick
Thus, the maximum performance is
1,032*980/3 = 350 GFlop


FP32 (float) performance 1981 GFLOPS

But how to execute the command for 0.5 tick ???
It turns out that real peak performance is 6 times less than specified in the specification.

The basic building block of the GPU is the fused multiply-add (FMA) which counts as two FPops. Properly written test apps can typically get 90+% of the theoretical throughput.

How are you preventing your kernel body from being optimized away completely? It does absolutely nothing…

Show an example of such a code. I took the simplest basic commands +, *. for them only one simple command is used. Faster is already theoretically impossible. But this is only 3 tick, only - 1/6 the theoretical throughput.

If you want to look at a bit of open-source code, consider taking a look at the source of CUDA-Z. This is a very simplistic test app for sure, but it demonstrates the principle of such measurements. As I said, FLOPS measurements would be based on FMA instruction throughput, and you can see this in the CUDA-Z source code.

When I run this on my GPUs its measured FLOPS rate seems to be within 95% of the theoretical FLOPS. You can also download pre-compiled versions of CUDA-Z for various OS platforms. I tried two different executables for Windows (0.9.231 and 0.10.251), and they show slight differences in their respective measurements (e.g. 3420 single-precision GFLOPS vs 3398 single-precision GFLOPS) but either way they are very close to the theoretical rate of 3525 GFLOPS (1721 MHz x 1024 CUDA cores x 1 FMA/core/cycle) for the Quadro P2000 I have currently installed.

Thank. I downloaded the program and the source code of CUDA-Z. My VS 2013 gives an error.
In file a similar construction is used, the meaning of which I described above

/*!	\brief 16 MAD instructions for float point test.
#define CZ_CALC_FMAD_16(a, b) \
	a = b * a + b; b = a * b + a; a = b * a + b; b = a * b + a; \
	a = b * a + b; b = a * b + a; a = b * a + b; b = a * b + a; \
	a = b * a + b; b = a * b + a; a = b * a + b; b = a * b + a; \
	a = b * a + b; b = a * b + a; a = b * a + b; b = a * b + a; \

/*!	\brief 256 MAD instructions for float point test.
#define CZ_CALC_FMAD_256(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \
	CZ_CALC_FMAD_16(a, b) CZ_CALC_FMAD_16(a, b) \

I ran a similar code. But the CUDA compiler does 2 assembler commands a = a * a + a FMUL and FADD.
In total, these teams gave 4 tick. Or for one team 2 tick or 500Gflop. Anyway, I do not see the real speed 2000GFlop. Although in integer commands INT 32 CUDA-Z also gives 350GFlop that I wrote above.

I checked different versions of the CUDA-Z program. And looked at the source code The code is the same everywhere for Float with single precision.
On my GTX660

Single precesion float - 1742Gflop
32-bit Integer - 358 Gilop

CUDA-Z 0.6.163
Single precesion float - 1251Gflop
32-bit Integer - 358 Gilop

My program
Single precesion float - 500Gflop
32-bit Integer - 358 Gilop

Code C is the same everywhere, but the results are very different for “Single precesion float”

This would suggest that you are using a debug build rather than a fully optimized release build. By default, the CUDA compiler (nvcc) turns on full optimizations, and this includes the contraction of floating-point add and dependent floating-point multiply into a single FMA operation. You can code single-precision FMAs explicitly by using the standard function fmaf() in your code.

If I read the tables in Wikipedia correctly, a GTX 660 provides a throughput of 1881 single-precision GFLOPS in theory, and CUDA-Z 0.10.x reports 92.6% of that in real life.

I have no idea what is going with the really old version (0.6.x) of CUDA-Z cited. That version might have had a bug or used poor timing methodology (e.g. insufficient number of repetitions). As with all software, it would make sense to use something close to the latest version available.

BTW, the integer throughput numbers of CUDA-Z appear to be based (in analogy to the floating-point test) on integer multiply-add, which makes this a questionable metric in my opinion, as most integer-intensive apps will not be dominated by integer multiplies.

njuffa, thank you for the answers. Very helpful information.
I wondered for a long time whether NVIDIA had optimized its teams for different versions of the cards. We often have only the number of CUDA processors and the frequency.
Pay attention to 1080TI and 2080Ti performance.
GTX 660 1032MHz*980 core = 1011 GHz_c (one comand- tick)
S-pF 1742 ~(0.5 tick)
D-pF 90 ~(11 tick)
64-int 90 ~(11 tick)
32-int 358 ~(3 tick)

1080TI 1650MHz*3584 core = 5913 Ghz_c
S-pF 13327 ~(0,5 tick)
D-pF 438 ~(13 tick)
64-int 1000 ~(6 tick)
32-int 4566 ~(1,3 tick)

2080TI 1635MHz*4352 core = 7115 Ghz_c
S-pF 16258 ~(0,4 tick)
D-pF 530 ~(13 tick)
64-int 3100 ~(2,5 tick)
32-int 16209 ~(0,4 tick)

It turns out NVIDIA dramatically improved the speed of execution of 32-integer and 64-integer commands.
But the floating-point numbers were not down at all for many years. Thus, if the program actively used int-32 commands, then the speed increased many dozens of times from generation to generation. Integer mathematics should greatly increase.

GPUs are 32-bit machines. All 64-bit integer operations are emulated. As I stated earlier, CUDA-Z appears to use integer multiply-add to measure integer throughput, although integer multiply-add is not a common operation in most integer-intensive applications. On various GPU architectures even 32-bit integer-multiply add is emulated.

For reasons I have no insight into (possibly related to AI?) NVIDIA has seen it fit to improve integer multiply throughput in recent architectures, and this is what you see reflected in the integer performance numbers reported by CUDA-Z. The throughput of simple integer ALU instructions in GPUs has always tightly tracked single-precision throughput on all architectures since the beginning of CUDA.

Sometimes I see statements in the literature that GPUs are not well-suited for integer-intensive workloads. These statements are false and have been false since CUDA came into existence. There is a reason that applications like prime-number searches and crypto-currency mining achieve much higher performance on GPUs than on CPUs …

I wanted to upgrade to 1080TI. But 2080TI is still too expensive. I thought that 2080TI/1080TI there is only 15% difference. But for integer comands the difference is 4x? Is it really so? It will be necessary to try to check for normal IMUL, ADD, SUB, SHA etc. Yeah))

Sometimes I see statements in the literature that GPUs are not well-suited for integer-intensive workloads.

For prime numbers, an increase of hundreds and thousands of times. But now even more. These programs practically do not use memory. Work very well in parallel. That is why they work with commands at peak processor power, without unnecessary read / write delays

No. The difference for IMAD throughput is 4x. If you look at the generated machine code with cuobjdump --dump-sass, I think you will find that a 32-bit IMAD operation on Pascal is emulated via a four-instruction sequence, whereas it is a single instruction on Turing.

Your characterization of prime-number searches is on the mark. But even for memory-intensive applications GPUs often offer a 5x advantage over a CPU solution, at least if the working set is big. CPUs are the better choice when (1) low latency is important (2) working sets are small (3) parallelism is limited. As Amdahl’s Law tells us, an application that is partially parallel workload will eventually be bottlenecked on the serial portion. Thus the ideal combination is a high-frequency CPU with lots of cache per core (> 3.5 GHz base clock preferred) coupled with a fast GPU, such as your GTX 1080Ti.

32-bit integer multiply should improve significantly in Volta/Turing vs. Pascal or Maxwell. Volta and Turing provide dedicated 32-bit integer hardware for this. Evidence of this is available in table 2 of the programming guide: