Branch divergence - for loop range

John_Smith_Lon · February 11, 2018, 1:19pm

Hi,

If I have a loop like:

int maxValue = someArray[threadIdx];
for(int i = 0; i < maxValue; i++)
{
    // do something
}

which is executed on device across many threads, will this generate a divergent branch since the for loop will iterate a different number of times for each thread?

Thanks!

wlangdon · February 11, 2018, 3:22pm

well yes (assuming the values in someArray are different).
But perhaps the first question is does such divergence matter?

Bill

BulatZiganshin · February 11, 2018, 9:57pm

yes, of course. threads are grouped into 32-wide warps, and any code that should executed by any thread in the warp, is [passively] executed by all threads in the warp

John_Smith_Lon · February 12, 2018, 12:33am

Thanks. So the problem I am trying to solve is I have an ordered array of decimals:

array = [0.1, 0.15, 0.2, …, 0.9, 0.99, 0.9999, 1]

which is typically 10,000 elements long.

Each thread will call curand_uniform to draw x between 0 and 1 (draw from curand_uniform, and needs to find the index i such that:

array[i] < x
x >= array[i+1].

My options seem to be either:

(1) To go through every row of the array from start to finish to avoid any branch divergence, e.g.

DECIMAL acc = 0.0;
			int selectedRow = 0;
			for (int j = 0; j < numRows; j++)
			{
				DECIMAL currProb = (*array)(j);
				bool ind1 = rand < acc;
				acc += (*array)(j);
				bool ind2 = rand >= acc;
				selectedRow = j * ind1 * ind2;
			}

Or to do something smarter than will end up with branch divergence, e.g.

(2) Do a quick pass through the array in steps of 50 (say) and find a rough interval to do a more detailed search. If I am careful here perhaps I could control the number of iterations to be constant in each pass through which might avoid the divergence.

(3) Do a binary search like in Search an ordered array in a CUDA kernel - Stack Overflow but again I believe this will generate divergent branches.

I will do some experiments to see what the timing impact of the different options are. Intuitively it feels like going through every row is likely to be slow given the size of the array.

Topic		Replies	Views
Loops in kernels CUDA Programming and Performance	2	1334	September 3, 2009
Branching? CUDA Programming and Performance	7	3162	March 16, 2012
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	2975	December 28, 2008
warp divergence triggered by for loop CUDA Programming and Performance	2	1653	April 2, 2018
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	14082	June 14, 2009
thread local 'for loop' question thread parallel for loop execution CUDA Programming and Performance	5	3397	August 29, 2007
Thread Divergence CUDA Programming and Performance	2	2743	September 27, 2008
Question about control flow divergence CUDA Programming and Performance	4	7324	July 24, 2008
Question about divergence and loops CUDA Programming and Performance	7	7086	November 21, 2010
Avoid branching ... CUDA Programming and Performance	3	3616	May 19, 2010

Branch divergence - for loop range

Related topics