Loop inside kernel or over kernels in host code? [performance question]

TheLetti · September 19, 2008, 7:38pm

I’ve an existing algorithm which basically consists of 5 for loops, where the inner loop does a single line computation (basically it is a complex value multiplication + an addition). So the (original = non-CUDA) code looks sthg like this:

// original computation (= host version):

extern "C" void ComputationOnHost(cuComplex **_X_h, cuComplex ****_Y_h, cuComplex ***_Z_h)

{

	for (a =0; a < A; a++)	// A typically = 1

	{

  for (b = 0; b < B; b++)	// B typically = 4

  {

  	for (c = 0; c < C; i++)	// C typically = 2

  	{

    for (d=0; d < D; j++)	// D is const = 124

    {

    	for (e = 0; e < E; e++)	// E is const = 493

    	{

      _X_h[c][e+d*8] += _Y_h[a][c][e][b] * _Z_h[a][8*d][b];

    	}

    }

  	}

  }

	}

}

In order to do some of the computations on the GPU, my next step was to replace the inner loop (looping from 0…492) by GPU threads. And one more note: as the original code consists of multidimensional arrays (see above, one 2D, 3D and 4D array), I translated this to 1dimensional arrays on the GPU, and compute the according index manually:

// accelerated version (= kernel version):

__global__ void ComputationOnDevice(cuComplex *_X_d, cuComplex *_Y_d, cuComplex *_Z_d)

{

	int idxL = blockIdx.x*blockDim.x + threadIdx.x;	// translates to an index between 0..492 (or 511 to be precise)

	for (a =0; a < A; a++)	// A typically = 1

	{

  for (b = 0; b < B; b++)	// B typically = 4

  {

  	for (c = 0; c < C; i++)	// C typically = 2

  	{

    for (d=0; d < D; j++)	// D is const = 124

    {

    	if(idxL < E)

      // this computation is practically the same as in host code, except that variable "l" was replaced by my threadIndex "idxL":

      _X_d[i*(E+D*8) + (idxL+d*8)] = cuCaddf(_X_d[i*(E+D*8) + (idxL+d*8)], cuCmulf(_Y_d[a*C*E*B + i*E*B + idxL*B + b], _Z_d[a*(8*D)*B + (8*d)*B + b]));

    	__syncthreads();

    }

  	}

  }

	}

}

Note: due to interdependencies between different loops, I couldn’t really replace any other loops (e.g. by using more blocks) - that is, because _X_d is written to the same memory location more than once (for d=0,1,2…), and therefore the order of execution is important.

Anyway. My real question now is whether it is more efficient to keep the 4 outer loops (a, b, c, d) inside the kernel (and thus only have a single kernel call), or if I should put that into host code, and do multiple kernel calls then?

Here’s the according host code of these 2 versions:

...

int blockSize = 256;

int nBlocks = 2;	// that's the maximum # of blocks I can use with this kernel code (more blocks would lead to wrong results, due to lack of synchronization)

// single kernel call (with all the loops inside kernel):

ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d);

...

Alternatively I have outsourced the loops into host code. What I expected here was a speedup, since I could launch 16 blocks in parallel (before, the maximum # of blocks was 2, because of data-interdependencies), whereas here all the threads are synched between each kernel call:

...

int blockSize = 32;

int nBlocks = 16;	// now I could launch more blocks, since kernel calls are synched in between

// multiple kernel calls:

for (a =0; a < A; a++)	// A typically = 1

{

	for (b = 0; b < B; b++)	// B typically = 4

	{

  for (c = 0; c < C; i++)	// C typically = 2

  {

  	for (d=0; d < D; j++)	// D is const = 124

  	{

    ComputationOnDevice <<< nBlocks, blockSize >>>(X_d, Y_d, Z_d, a, b, c, d);

  	}

  }

	}

}

...

Either way. In practice I see a significant slow down in the last version (loops in host code), although I expected the opposite. Does anybody have an explanation why that is, or other recommendations on how to speed-up this code?

Thanks a lot

(especially if you have read this far ;-)

Michael

Tigga · September 22, 2008, 9:09am

The GPU can only execute one kernel at a time, so many small kernel launches aren’t good.

It should be possible for you to have no loops at all. This could be achieved by either using atomic operations or by collecting the results in a larger array (in shared or global memory - depending on sizes), and then summing them. With for loops the problem isn’t reallly parallel enough.

bog · September 23, 2008, 3:34pm

Take a look at reductions examples. As said above, you could compute the products, store them in shared mem, and add them toghether through a reduction.

TheLetti · September 23, 2008, 5:31pm

Tigga, thanks for the reply. I am fully aware of the fact that only 1 kernel can be executed at a time. Still I don’t understand why looping over multiple kernels is much slower than doing the for loops inside a single kernel (where I wait for sync(!) after each computation…therefore it should be as slow as the multiple kernel version).

Well normally I would agree that I should get rid of all the FOR loops. The problem I am facing however is that I only can parallelize the inner loop (that is, 493 iterations), as the outer loop depends on results of that inner one.

Any other comments/ideas?
Or are you just deterred from my long post? ;-)
In that case, it would be enough if u would just focus on my last paragraph of my initial statement (cause all it comes down to is the question in the Title of my post)

thx,
Michael

E.D_Riedijk · September 23, 2008, 5:44pm

I really do not see why.

The only problem is _X_h[c][e+d*8] += …

e = 0, d = 1 gets added to e=8, d=0, etc. But that you can split out, and use a reduction to add those together afterwards.

TheLetti · September 23, 2008, 10:45pm

Well with “as the outer loop depends on results of that inner one” I meant the fact that the computation reads from a memory location that was written before by another thread - so the order of execution matters.

Simplified what it does is:

X = X + Y * Z

Yes, the Y * Z could be computed in parallel, only problem I had was with X. I don’t know yet what “reduction” exactly does or how it could help me here, but I’ll take a look into that example. Sounds good!

btw: as Y and Z are not modified, it would make sense to save them in constant or texture memory, right? (which one is better? any differences?)

Thanks,

Michael

E.D_Riedijk · September 24, 2008, 4:50am

I read X += Y*Z

So you can give you X an extra dimension (c,d & e) and afterward add the d and e dimensions in a complicated way to achieve your 2D result. A reduction is a way to reduce a vector (or array) to a single value. That might be the sum, min, max of the vector.

Tigga · September 24, 2008, 9:08am

Sorry - I misread something. One reason one is faster than the other is the different threads/blocks configuration. 256 threads in 2 blocks will (as I understand it) only use two SMs - the rest will sit idle. 32 threads in 16 blocks will spread across the card a lot better.

Either way you’re still at the stage where doubling the size of the number of threads is going to have an insignificant effect on the speed of the kernel. If you’re not convinced - try graphing it. It may hit home just how parallel GPUs are.

RE: Reduction

What you’re doing per x element is:

x = x(initial) + sum(y(i) * z(i))

A more parallel way of doing this would be to caluclate y(i) * z(i) for all i, sum (ie. reduce) them, then add them to the initial value of x.

TheLetti · September 25, 2008, 9:45pm

Exactly - but note: the 32/16 configuration is actually for my second kernel, which turned out to be much slower! That’s why I’m confused - should be the other way round!!

@Tigga/Denis:

Thanks for the hints. I’ll take a look into Reduction asap.

Topic		Replies	Views
'for' loop performance hacks? CUDA Programming and Performance	17	10508	February 28, 2009
CUDA perormances CUDA Programming and Performance	10	7127	January 22, 2008
CUDA kernel for-loop performance CUDA Programming and Performance	16	6331	September 7, 2019
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13931	September 5, 2008
where to loop CUDA Programming and Performance	5	3995	December 11, 2007
Execute instruction only once inside a block/grid? CUDA Programming and Performance	7	2006	May 10, 2010
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8616	December 18, 2008
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7892	August 16, 2007
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010

Loop inside kernel or over kernels in host code? [performance question]

Related topics