Missing about 20 GFlops

Hi,

I do have a scenario, for which I cannot make much sense of. It is all about a simple matrix multiplication sample and an additonal loop. There are three scenarios:

[list=1]

[*] A trivial matrix multiplication, using work groups to calculate sub-matrixes of the result matrix. I get about 140 GFlops for this scenario.

[*] The same simple matrix multiplication code, but with a loop around it. The number of loop iterations is known at compile time of the kernel. For this code I get again about 140 GFlops.

[*] Same as before, but the number of loop iterations is not known at compile time. I only get about 120 GFlops.

In the 3rd scenario I made sure that there are no additional global memory accesses. The only difference seems to be the loop. This behaviour is identical even if the loop is only executed once.

Has anyone an idea why I loose about 20 GFlops when adding a loop? Is this an expected behaviour, or am I most likely doing something wrong?

I am using a GTX 280 with the CUDA 3.0 beta.

-Jens

The reason is quite simple. The compiler can unroll a loop if it knows how many iterations it will do. Unrolling the loop decreases overhead, and thus makes the code run faster. However, if the compiler can’t tell the number of iterations, unrolling is impossible without introducing incorrect behavior.

for (int n = 0; n < 400; n++)

{

	DoSomething();

}

becomes something like

for (int n = 0; n < 400; n += 4)

{

	DoSomething();

	DoSomething();

	DoSomething();

	DoSomething();

}

Which removes 3/4ths of the looping tests.

However,

for (int n = 0; n < j; n+=4)

{

	DoSomething();

	DoSomething();

	DoSomething();

	DoSomething();

}

might not work correctly - suppose j isn’t divisible by 4…

Thanks, this explains scenario 2. But I am still wondering if adding a simple loop should decrease performance by 20GFlops. I do understand that a loop imposes some overhead, but I find 20 GFlops for a loop with only on iteration quite much.

If you do only a single operation in the loop body, you can get much more than 20GFLOPS loss. Consider

for(int i=0; i<n; i++)

  a += i;

You’re doing a single addition within the loop. But the loop code does another addition (i++), a branch check (i<n) and a jump instruction (back to the start of the loop if i<n). Compile your code with -keep to get the resulting .ptx and inspect how many additional instructions the loop causes.

I am actually doing quite a lot of work inside the loop. I have reimplemented the kernel with CUDA and the performance difference between scenario 2 and 3 is only 4 GFlops, so I guess generated code based on the OpenCL kernel is sub-optimal.

Yeah, the OpenCL compiler is certainly far from ideal currently. I wish they would release an updated version, but I suspect we’ll have to wait for Fermi to be released for that.