It will execute in parallel as long as there is no divergence of execution, that is, as long as the same instructions can be executed for all threads within a warp. When the loop finishes for one tid, but is incomplete for another, you’ll have a divergence. In that case, the finished tid will sit idle until all other threads are completed.
So the answer to your question depends on the distribution of the number of loops for each thread. If most threads within the same warp loop, say, 3 times, and 1 thread loops 100 times, you have a problem (= it will be slow). But if some threads are looping 31 times and some others loop 33 times, then it’s probably ok.
TomV is completely correct, I just want to add my 2 cents. I have a loop very similar to yours in my code, though the loop body is pretty large and does a lot of computations. On average, each loop does about 100 iterations, though it varies from 40 to 120 somewhat randomly.
I’ve tried several ways of preventing the divergence, but ended up falling back on the simplest code that has the divergence. The hardware seems very efficient at handling the divergence and I notice only the tiniest performance difference when the code is run on a real dataset compare to running on a dataset where every loop runs for exactly 100 iterations and there is no divergence.