I do have a scenario, for which I cannot make much sense of. It is all about a simple matrix multiplication sample and an additonal loop. There are three scenarios:
A trivial matrix multiplication, using work groups to calculate sub-matrixes of the result matrix. I get about 140 GFlops for this scenario.
The same simple matrix multiplication code, but with a loop around it. The number of loop iterations is known at compile time of the kernel. For this code I get again about 140 GFlops.
Same as before, but the number of loop iterations is not known at compile time. I only get about 120 GFlops.
In the 3rd scenario I made sure that there are no additional global memory accesses. The only difference seems to be the loop. This behaviour is identical even if the loop is only executed once.
Has anyone an idea why I loose about 20 GFlops when adding a loop? Is this an expected behaviour, or am I most likely doing something wrong?
I am using a GTX 280 with the CUDA 3.0 beta.