Does CUDA support variable loop limits?

Hi;

I am about to start to optimize a piece of code by using CUDA. I am a newbie to it, so I am stuck with a simple question at the moment. In nearly all examples I have seen about CUDA, the limits of the for loops are passed to kernels from the main code; such that it is known how many times a for loop in a kernel will iterate, before calling the kernel. What if we can only determine the iteration count at the runtime of the kernel?

For example:

…Kernel Code…
int i;
int loopLimit=Some_Value_Calculated_Inside_of_the_Kernel;

for(i=0;i<loopLimit;i++)
{
//Do Something
}
…Kernel Code…

Since the variable loopLimit can take different values in every different thread in a block, the threads would diverge since they would iterate the for loop in different amounts. How does CUDA handle that, or does it even support it?

Thanks in advance.

it should be ok, if the loopLimit variable is same for threads in a warp. other wise you’ll get warp divergence

Yes that’s fine. Warp serialization due to divergent loopLimits might hamper performance but it’s not a showstopper.