I decided to measure the execution time of simple commands: i *= 1 or i += 1.
I tried i float and integer.
Exemple code
__global__ void addKernel()
{
float i1 = 1;float i2 = 1;float i3 = 1;float i4 = 1;float i5 = 1;
float i6 = 1;float i7 = 1;float i8 = 1;float i9 = 1;float i10 = 1;
i1 *= 1; //string1
i2 *= 1; //string1
i3 *= 1; //string1
i4 *= 1; //string1
i5 *= 1; //string1
i6 *= 1; //string1
i7 *= 1; //string1
i8 *= 1; //string1
i9 *= 1; //string1
i10 *= 1; //string1
i1 *= 1; //string2
i2 *= 1; //string2
i3 *= 1; //string2
i4 *= 1; //string2
i5 *= 1; //string2
i6 *= 1; //string2
i7 *= 1; //string2
i8 *= 1; //string2
i9 *= 1; //string2
i10 *= 1; //string2
...
...
...
i1 *= 1; //string1000
i2 *= 1; //string1000
i3 *= 1; //string1000
i4 *= 1; //string1000
i5 *= 1; //string1000
i6 *= 1; //string1000
i7 *= 1; //string1000
i8 *= 1; //string1000
i9 *= 1; //string1000
i10 *= 1; //string1000
}
The CUDA assembler gives us ONLY one command for this operation.
For Float FMUL, FADD;
for integer IMUL, IADD
There are no cycles here, there is only one command executed strictly 10,000 times within a single thread using exclusively the CUDA register.
On my GTX 660, I have time to execute the command
3 tick FMUL, FADD, IADD
5 tick IMUL
GPU Clock 1032 MHz = 1,032GHz
CUDA processors 980
1 operation - 3 tick
Thus, the maximum performance is
1,032*980/3 = 350 GFlop
Here:
FP32 (float) performance 1981 GFLOPS
But how to execute the command for 0.5 tick ???
It turns out that real peak performance is 6 times less than specified in the specification.