I’ve an existing algorithm which basically consists of 5 for loops, where the inner loop does a single line computation (basically it is a complex value multiplication + an addition). So the (original = non-CUDA) code looks sthg like this:
// original computation (= host version):
extern "C" void ComputationOnHost(cuComplex **_X_h, cuComplex ****_Y_h, cuComplex ***_Z_h)
{
for (a =0; a < A; a++) // A typically = 1
{
for (b = 0; b < B; b++) // B typically = 4
{
for (c = 0; c < C; i++) // C typically = 2
{
for (d=0; d < D; j++) // D is const = 124
{
for (e = 0; e < E; e++) // E is const = 493
{
_X_h[c][e+d*8] += _Y_h[a][c][e][b] * _Z_h[a][8*d][b];
}
}
}
}
}
}
In order to do some of the computations on the GPU, my next step was to replace the inner loop (looping from 0…492) by GPU threads. And one more note: as the original code consists of multidimensional arrays (see above, one 2D, 3D and 4D array), I translated this to 1dimensional arrays on the GPU, and compute the according index manually:
// accelerated version (= kernel version):
__global__ void ComputationOnDevice(cuComplex *_X_d, cuComplex *_Y_d, cuComplex *_Z_d)
{
int idxL = blockIdx.x*blockDim.x + threadIdx.x; // translates to an index between 0..492 (or 511 to be precise)
for (a =0; a < A; a++) // A typically = 1
{
for (b = 0; b < B; b++) // B typically = 4
{
for (c = 0; c < C; i++) // C typically = 2
{
for (d=0; d < D; j++) // D is const = 124
{
if(idxL < E)
// this computation is practically the same as in host code, except that variable "l" was replaced by my threadIndex "idxL":
_X_d[i*(E+D*8) + (idxL+d*8)] = cuCaddf(_X_d[i*(E+D*8) + (idxL+d*8)], cuCmulf(_Y_d[a*C*E*B + i*E*B + idxL*B + b], _Z_d[a*(8*D)*B + (8*d)*B + b]));
__syncthreads();
}
}
}
}
}
Note: due to interdependencies between different loops, I couldn’t really replace any other loops (e.g. by using more blocks) - that is, because _X_d is written to the same memory location more than once (for d=0,1,2…), and therefore the order of execution is important.
Anyway. My real question now is whether it is more efficient to keep the 4 outer loops (a, b, c, d) inside the kernel (and thus only have a single kernel call), or if I should put that into host code, and do multiple kernel calls then?
Here’s the according host code of these 2 versions:
...
int blockSize = 256;
int nBlocks = 2; // that's the maximum # of blocks I can use with this kernel code (more blocks would lead to wrong results, due to lack of synchronization)
// single kernel call (with all the loops inside kernel):
ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d);
...
Alternatively I have outsourced the loops into host code. What I expected here was a speedup, since I could launch 16 blocks in parallel (before, the maximum # of blocks was 2, because of data-interdependencies), whereas here all the threads are synched between each kernel call:
...
int blockSize = 32;
int nBlocks = 16; // now I could launch more blocks, since kernel calls are synched in between
// multiple kernel calls:
for (a =0; a < A; a++) // A typically = 1
{
for (b = 0; b < B; b++) // B typically = 4
{
for (c = 0; c < C; i++) // C typically = 2
{
for (d=0; d < D; j++) // D is const = 124
{
ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d, a, b, c, d);
}
}
}
}
...
Either way. In practice I see a significant slow down in the last version (loops in host code), although I expected the opposite. Does anybody have an explanation why that is, or other recommendations on how to speed-up this code?
Thanks a lot
(especially if you have read this far ;-)
Michael