thread local 'for loop' question thread parallel for loop execution


When I have the following kernel:

int tid = threadIdx.x;

for (int i = a[tid]; i < b[tid]; i++)

    c[tid] = ...

Will the part that could be executed in parallel really run in parallel? Or is it a problem that the for loop does not have the same size in every thread?


It will execute in parallel as long as there is no divergence of execution, that is, as long as the same instructions can be executed for all threads within a warp. When the loop finishes for one tid, but is incomplete for another, you’ll have a divergence. In that case, the finished tid will sit idle until all other threads are completed.

So the answer to your question depends on the distribution of the number of loops for each thread. If most threads within the same warp loop, say, 3 times, and 1 thread loops 100 times, you have a problem (= it will be slow). But if some threads are looping 31 times and some others loop 33 times, then it’s probably ok.


TomV is completely correct, I just want to add my 2 cents. I have a loop very similar to yours in my code, though the loop body is pretty large and does a lot of computations. On average, each loop does about 100 iterations, though it varies from 40 to 120 somewhat randomly.

I’ve tried several ways of preventing the divergence, but ended up falling back on the simplest code that has the divergence. The hardware seems very efficient at handling the divergence and I notice only the tiniest performance difference when the code is run on a real dataset compare to running on a dataset where every loop runs for exactly 100 iterations and there is no divergence.

good, that’s exactly the answer I was hoping for :)

what if I would have 2 nested loops of the same kind (as described above)? is cuda still that smart? or would i need to synchronize after the execution of the inner loop?


You only need to synchronize threads when there are possible race conditions with the shared memory. The hardware handles all divergent warps without any user intervention.