Branch divergence - for loop range

Hi,

If I have a loop like:

int maxValue = someArray[threadIdx];
for(int i = 0; i < maxValue; i++)
{
    // do something
}

which is executed on device across many threads, will this generate a divergent branch since the for loop will iterate a different number of times for each thread?

Thanks!

well yes (assuming the values in someArray are different).
But perhaps the first question is does such divergence matter?

Bill

yes, of course. threads are grouped into 32-wide warps, and any code that should executed by any thread in the warp, is [passively] executed by all threads in the warp

Thanks. So the problem I am trying to solve is I have an ordered array of decimals:

array = [0.1, 0.15, 0.2, …, 0.9, 0.99, 0.9999, 1]

which is typically 10,000 elements long.

Each thread will call curand_uniform to draw x between 0 and 1 (draw from curand_uniform, and needs to find the index i such that:

array[i] < x
x >= array[i+1].

My options seem to be either:

(1) To go through every row of the array from start to finish to avoid any branch divergence, e.g.

DECIMAL acc = 0.0;
			int selectedRow = 0;
			for (int j = 0; j < numRows; j++)
			{
				DECIMAL currProb = (*array)(j);
				bool ind1 = rand < acc;
				acc += (*array)(j);
				bool ind2 = rand >= acc;
				selectedRow = j * ind1 * ind2;
			}

Or to do something smarter than will end up with branch divergence, e.g.

(2) Do a quick pass through the array in steps of 50 (say) and find a rough interval to do a more detailed search. If I am careful here perhaps I could control the number of iterations to be constant in each pass through which might avoid the divergence.

(3) Do a binary search like in https://stackoverflow.com/questions/21658518/search-an-ordered-array-in-a-cuda-kernel but again I believe this will generate divergent branches.

I will do some experiments to see what the timing impact of the different options are. Intuitively it feels like going through every row is likely to be slow given the size of the array.