 # Loop inside kernel or over kernels in host code? [performance question]

I’ve an existing algorithm which basically consists of 5 for loops, where the inner loop does a single line computation (basically it is a complex value multiplication + an addition). So the (original = non-CUDA) code looks sthg like this:

``````// original computation (= host version):

extern "C" void ComputationOnHost(cuComplex **_X_h, cuComplex ****_Y_h, cuComplex ***_Z_h)

{

for (a =0; a < A; a++)	// A typically = 1

{

for (b = 0; b < B; b++)	// B typically = 4

{

for (c = 0; c < C; i++)	// C typically = 2

{

for (d=0; d < D; j++)	// D is const = 124

{

for (e = 0; e < E; e++)	// E is const = 493

{

_X_h[c][e+d*8] += _Y_h[a][c][e][b] * _Z_h[a][8*d][b];

}

}

}

}

}

}
``````

In order to do some of the computations on the GPU, my next step was to replace the inner loop (looping from 0…492) by GPU threads. And one more note: as the original code consists of multidimensional arrays (see above, one 2D, 3D and 4D array), I translated this to 1dimensional arrays on the GPU, and compute the according index manually:

``````// accelerated version (= kernel version):

__global__ void ComputationOnDevice(cuComplex *_X_d, cuComplex *_Y_d, cuComplex *_Z_d)

{

int idxL = blockIdx.x*blockDim.x + threadIdx.x;	// translates to an index between 0..492 (or 511 to be precise)

for (a =0; a < A; a++)	// A typically = 1

{

for (b = 0; b < B; b++)	// B typically = 4

{

for (c = 0; c < C; i++)	// C typically = 2

{

for (d=0; d < D; j++)	// D is const = 124

{

if(idxL < E)

// this computation is practically the same as in host code, except that variable "l" was replaced by my threadIndex "idxL":

_X_d[i*(E+D*8) + (idxL+d*8)] = cuCaddf(_X_d[i*(E+D*8) + (idxL+d*8)], cuCmulf(_Y_d[a*C*E*B + i*E*B + idxL*B + b], _Z_d[a*(8*D)*B + (8*d)*B + b]));

}

}

}

}

}
``````

Note: due to interdependencies between different loops, I couldn’t really replace any other loops (e.g. by using more blocks) - that is, because _X_d is written to the same memory location more than once (for d=0,1,2…), and therefore the order of execution is important.

Anyway. My real question now is whether it is more efficient to keep the 4 outer loops (a, b, c, d) inside the kernel (and thus only have a single kernel call), or if I should put that into host code, and do multiple kernel calls then?

Here’s the according host code of these 2 versions:

``````...

int blockSize = 256;

int nBlocks = 2;	// that's the maximum # of blocks I can use with this kernel code (more blocks would lead to wrong results, due to lack of synchronization)

// single kernel call (with all the loops inside kernel):

ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d);

...
``````

Alternatively I have outsourced the loops into host code. What I expected here was a speedup, since I could launch 16 blocks in parallel (before, the maximum # of blocks was 2, because of data-interdependencies), whereas here all the threads are synched between each kernel call:

``````...

int blockSize = 32;

int nBlocks = 16;	// now I could launch more blocks, since kernel calls are synched in between

// multiple kernel calls:

for (a =0; a < A; a++)	// A typically = 1

{

for (b = 0; b < B; b++)	// B typically = 4

{

for (c = 0; c < C; i++)	// C typically = 2

{

for (d=0; d < D; j++)	// D is const = 124

{

ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d, a, b, c, d);

}

}

}

}

...
``````

Either way. In practice I see a significant slow down in the last version (loops in host code), although I expected the opposite. Does anybody have an explanation why that is, or other recommendations on how to speed-up this code?

Thanks a lot

(especially if you have read this far ;-)

Michael

The GPU can only execute one kernel at a time, so many small kernel launches aren’t good.

It should be possible for you to have no loops at all. This could be achieved by either using atomic operations or by collecting the results in a larger array (in shared or global memory - depending on sizes), and then summing them. With for loops the problem isn’t reallly parallel enough.

Take a look at reductions examples. As said above, you could compute the products, store them in shared mem, and add them toghether through a reduction.

Tigga, thanks for the reply. I am fully aware of the fact that only 1 kernel can be executed at a time. Still I don’t understand why looping over multiple kernels is much slower than doing the for loops inside a single kernel (where I wait for sync(!) after each computation…therefore it should be as slow as the multiple kernel version).

Well normally I would agree that I should get rid of all the FOR loops. The problem I am facing however is that I only can parallelize the inner loop (that is, 493 iterations), as the outer loop depends on results of that inner one.

Or are you just deterred from my long post? ;-)
In that case, it would be enough if u would just focus on my last paragraph of my initial statement (cause all it comes down to is the question in the Title of my post)

thx,
Michael

I really do not see why.

The only problem is _X_h[c][e+d*8] += …

e = 0, d = 1 gets added to e=8, d=0, etc. But that you can split out, and use a reduction to add those together afterwards.

Well with “as the outer loop depends on results of that inner one” I meant the fact that the computation reads from a memory location that was written before by another thread - so the order of execution matters.

Simplified what it does is:

X = X + Y * Z

Yes, the Y * Z could be computed in parallel, only problem I had was with X. I don’t know yet what “reduction” exactly does or how it could help me here, but I’ll take a look into that example. Sounds good!

btw: as Y and Z are not modified, it would make sense to save them in constant or texture memory, right? (which one is better? any differences?)

Thanks,

Michael

So you can give you X an extra dimension (c,d & e) and afterward add the d and e dimensions in a complicated way to achieve your 2D result. A reduction is a way to reduce a vector (or array) to a single value. That might be the sum, min, max of the vector.

Sorry - I misread something. One reason one is faster than the other is the different threads/blocks configuration. 256 threads in 2 blocks will (as I understand it) only use two SMs - the rest will sit idle. 32 threads in 16 blocks will spread across the card a lot better.

Either way you’re still at the stage where doubling the size of the number of threads is going to have an insignificant effect on the speed of the kernel. If you’re not convinced - try graphing it. It may hit home just how parallel GPUs are.

RE: Reduction

What you’re doing per x element is:

x = x(initial) + sum(y(i) * z(i))

A more parallel way of doing this would be to caluclate y(i) * z(i) for all i, sum (ie. reduce) them, then add them to the initial value of x.

Exactly - but note: the 32/16 configuration is actually for my second kernel, which turned out to be much slower! That’s why I’m confused - should be the other way round!!

@Tigga/Denis:

Thanks for the hints. I’ll take a look into Reduction asap.