Hi;
I am about to start to optimize a piece of code by using CUDA. I am a newbie to it, so I am stuck with a simple question at the moment. In nearly all examples I have seen about CUDA, the limits of the for loops are passed to kernels from the main code; such that it is known how many times a for loop in a kernel will iterate, before calling the kernel. What if we can only determine the iteration count at the runtime of the kernel?
For example:
…Kernel Code…
int i;
int loopLimit=Some_Value_Calculated_Inside_of_the_Kernel;
for(i=0;i<loopLimit;i++)
{
//Do Something
}
…Kernel Code…
Since the variable loopLimit can take different values in every different thread in a block, the threads would diverge since they would iterate the for loop in different amounts. How does CUDA handle that, or does it even support it?
Thanks in advance.