Parallelizing for loops using CUDA


I have a for loop which takes around 16 ms to execute and it is executed conditionally under another for loop for 500 times.

Serial code format is like this:

//Outer for loop

//some conditions

// some function calls
// some nested function calls
// inner for loop
for (j=0;some condition;j++){



I want to parallelize the inner for loop.
Is it possible by CUDA programming to reduce the time required to execute inner for loop by 40% and hence the total time required to run the serial code?

Please help.



YOu need to get more details about the inner loop. First what is “some condition”, second is the instruction at j of the instruction at j’? Third what is before and after the inner loop. How often will be required to copy the data fro cpu to gpu and back?

yup. I’ll find that out. But, my basic question is for prallelising a loop which takes ~16ms time to execute and hence reducing the overall time required to execute the outer for loop, is CUDA a good solution?

The answer is a clear “maybe”.

It all depends on what is happening in the inner loop.