I am about to start to optimize a piece of code by using CUDA. I am a newbie to it, so I am stuck with a simple question at the moment. In nearly all examples I have seen about CUDA, the limits of the for loops are passed to kernels from the main code; such that it is known how many times a for loop in a kernel will iterate, before calling the kernel. What if we can only determine the iteration count at the runtime of the kernel?
Since the variable loopLimit can take different values in every different thread in a block, the threads would diverge since they would iterate the for loop in different amounts. How does CUDA handle that, or does it even support it?
Thanks in advance.