Performance for primitive operations

Hello,

Please consider the attached program. It multiplies a provided input vector by a matrix and saves it as an output vector. Every row of the output vector is calculated by a separate thread. The calculation is repeated in a loop to reduce the influence of startup overhead.

On a Tesla C2050, the calculation takes 1.7us. This doesn’t seem unreasonable, because the values have to be read and written from device memory.

However, if I uncomment the second inner loop to do the calculation a second time (using only register and shared memory variables), the performance drops to 3.6us. I don’t understand this. There are just about 2 * VI_SIZE = 160 additional floating point operations, at 1.3Ghz surely they should not take more than (160 / 1.3e-9) / (1e-6) = 0.2 us?

Can someone explain where I am wrong?

Thanks,
gpu_benchmark.cu (2.81 KB)

vmul is too large to be stored in registers, so it has to go into local memory (which is the same (relatively) slow off-chip memory as global memory). Unless the compiler completely unrolls all loops, vmul also needs to be stored in local mem in order to make it indexable. Thus the program is completely memory bandwidth bound (as 160 bytes per thread is too much to even be held by the cache), and from twice the number of memory operations in the inner loops you get doubled runtime.

If you run fewer than 600 threads per multiprocessor, setting the cache size to 48 Kb might just be enough to see the speedup you expect.