Parallelisim

Ok I think this is impossible but here goes.

I have the following code

for (x=0; x< 8; x++);

{

...

  for (z=0; z<1000; z++);

   {

  ...

   }

...

}

Is it possible for me to parallelise both the inner and the outer loops here? At the moment I am calling the GPU to run the inner loop and just using the CPU for the outer loop but is it possible to use the GPU for both? With these figure i’m guessing my way is quicker but if it was x<1000 and z<1000 that probably wouldn’t be the case.

Cheers,

Chris

I think it really depends on what processing you’re doing in the loops. I’ve just implemented some functions that have similar access patterns when done in pure C. I basically just converted the entire loop structure to a single kernel with a 1/2 way educated access structure and it’s super fast. There wasn’t really any dependency at all on previous calculations on the data however.

If the actions of the loop depend a lot on previous actions of the loop, it gets tougher to do that.

If all iterations of the loops are independent, then you can use a 2D thread block to run this, no problem. If there are dependencies, as stated by SrJsignal, then things get trickier, but not impossible, depending on the dependencies.

Mark

I’ve pretty much solved this now.

I’ve managed to unroll both the loops and remove the dependencies giving me

for (i = 0; i < (8*1000); i++){

some work

}

I’m working with three loops of over 1000000 now (instead of 3 lots of 2 nested loops of 1000) allowing me to call three separate CUDA functions that have really high levels of parallelism.

I’m hoping for a huge performance advantage over CPU, but I’m just waiting to run out of memory…

Chris