I decided to measure the execution time of simple commands: i *= 1 or i += 1.

I tried i float and integer.

Exemple code

```
__global__ void addKernel()
{
float i1 = 1;float i2 = 1;float i3 = 1;float i4 = 1;float i5 = 1;
float i6 = 1;float i7 = 1;float i8 = 1;float i9 = 1;float i10 = 1;
i1 *= 1; //string1
i2 *= 1; //string1
i3 *= 1; //string1
i4 *= 1; //string1
i5 *= 1; //string1
i6 *= 1; //string1
i7 *= 1; //string1
i8 *= 1; //string1
i9 *= 1; //string1
i10 *= 1; //string1
i1 *= 1; //string2
i2 *= 1; //string2
i3 *= 1; //string2
i4 *= 1; //string2
i5 *= 1; //string2
i6 *= 1; //string2
i7 *= 1; //string2
i8 *= 1; //string2
i9 *= 1; //string2
i10 *= 1; //string2
...
...
...
i1 *= 1; //string1000
i2 *= 1; //string1000
i3 *= 1; //string1000
i4 *= 1; //string1000
i5 *= 1; //string1000
i6 *= 1; //string1000
i7 *= 1; //string1000
i8 *= 1; //string1000
i9 *= 1; //string1000
i10 *= 1; //string1000
}
```

The CUDA assembler gives us ONLY one command for this operation.

For Float FMUL, FADD;

for integer IMUL, IADD

There are no cycles here, there is only one command executed strictly 10,000 times within a single thread using exclusively the CUDA register.

On my GTX 660, I have time to execute the command

3 tick FMUL, FADD, IADD

5 tick IMUL

GPU Clock 1032 MHz = 1,032GHz

CUDA processors 980

1 operation - 3 tick

Thus, the maximum performance is

1,032*980/3 = 350 GFlop

Here:

FP32 (float) performance 1981 GFLOPS

But how to execute the command for 0.5 tick ???

It turns out that real peak performance is 6 times less than specified in the specification.