thread local 'for loop' question thread parallel for loop execution

ashcor · August 6, 2007, 10:31pm

Hi

When I have the following kernel:

int tid = threadIdx.x;

for (int i = a[tid]; i < b[tid]; i++)

    c[tid] = ...

Will the part that could be executed in parallel really run in parallel? Or is it a problem that the for loop does not have the same size in every thread?

Severin

TomV · August 7, 2007, 7:42am

Hi

When I have the following kernel:
int tid = threadIdx.x;

for (int i = a[tid]; i < b[tid]; i++)

    c[tid] = ...
Will the part that could be executed in parallel really run in parallel? Or is it a problem that the for loop does not have the same size in every thread?

Severin

[snapback]233679[/snapback]

It will execute in parallel as long as there is no divergence of execution, that is, as long as the same instructions can be executed for all threads within a warp. When the loop finishes for one tid, but is incomplete for another, you’ll have a divergence. In that case, the finished tid will sit idle until all other threads are completed.

So the answer to your question depends on the distribution of the number of loops for each thread. If most threads within the same warp loop, say, 3 times, and 1 thread loops 100 times, you have a problem (= it will be slow). But if some threads are looping 31 times and some others loop 33 times, then it’s probably ok.

Tom

MisterAnderson42 · August 7, 2007, 1:35pm

TomV is completely correct, I just want to add my 2 cents. I have a loop very similar to yours in my code, though the loop body is pretty large and does a lot of computations. On average, each loop does about 100 iterations, though it varies from 40 to 120 somewhat randomly.

I’ve tried several ways of preventing the divergence, but ended up falling back on the simplest code that has the divergence. The hardware seems very efficient at handling the divergence and I notice only the tiniest performance difference when the code is run on a real dataset compare to running on a dataset where every loop runs for exactly 100 iterations and there is no divergence.

ashcor · August 7, 2007, 3:22pm

good, that’s exactly the answer I was hoping for :)
thanks

ashcor · August 29, 2007, 6:42pm

what if I would have 2 nested loops of the same kind (as described above)? is cuda still that smart? or would i need to synchronize after the execution of the inner loop?

regards

MisterAnderson42 · August 29, 2007, 8:34pm

You only need to synchronize threads when there are possible race conditions with the shared memory. The hardware handles all divergent warps without any user intervention.

Topic		Replies	Views
Loops in kernels CUDA Programming and Performance	2	1325	September 3, 2009
Question about divergence and loops CUDA Programming and Performance	7	7075	November 21, 2010
loop inside kernel CUDA Programming and Performance	9	7607	May 4, 2009
Thread question CUDA Programming and Performance	5	1878	December 2, 2008
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2954	December 28, 2008
Thread Divergence CUDA Programming and Performance	2	2739	September 27, 2008
Question about control flow divergence CUDA Programming and Performance	4	7315	July 24, 2008
Performance of Divergent Threads CUDA Programming and Performance	2	1640	September 29, 2008
Difference between Thread Divergence and Warp Divergence CUDA Programming and Performance	3	9503	September 7, 2018
Does CUDA support variable loop limits? CUDA Programming and Performance	2	1207	October 12, 2011

thread local 'for loop' question thread parallel for loop execution

Related topics