Please consider the attached program. It multiplies a provided input vector by a matrix and saves it as an output vector. Every row of the output vector is calculated by a separate thread. The calculation is repeated in a loop to reduce the influence of startup overhead.
On a Tesla C2050, the calculation takes 1.7us. This doesn’t seem unreasonable, because the values have to be read and written from device memory.
However, if I uncomment the second inner loop to do the calculation a second time (using only register and shared memory variables), the performance drops to 3.6us. I don’t understand this. There are just about 2 * VI_SIZE = 160 additional floating point operations, at 1.3Ghz surely they should not take more than (160 / 1.3e-9) / (1e-6) = 0.2 us?
Can someone explain where I am wrong?
gpu_benchmark.cu (2.81 KB)