Thanks. So the problem I am trying to solve is I have an ordered array of decimals:
array = [0.1, 0.15, 0.2, …, 0.9, 0.99, 0.9999, 1]
which is typically 10,000 elements long.
Each thread will call curand_uniform to draw x between 0 and 1 (draw from curand_uniform, and needs to find the index i such that:
array[i] < x
x >= array[i+1].
My options seem to be either:
(1) To go through every row of the array from start to finish to avoid any branch divergence, e.g.
DECIMAL acc = 0.0;
int selectedRow = 0;
for (int j = 0; j < numRows; j++)
DECIMAL currProb = (*array)(j);
bool ind1 = rand < acc;
acc += (*array)(j);
bool ind2 = rand >= acc;
selectedRow = j * ind1 * ind2;
Or to do something smarter than will end up with branch divergence, e.g.
(2) Do a quick pass through the array in steps of 50 (say) and find a rough interval to do a more detailed search. If I am careful here perhaps I could control the number of iterations to be constant in each pass through which might avoid the divergence.
(3) Do a binary search like in https://stackoverflow.com/questions/21658518/search-an-ordered-array-in-a-cuda-kernel but again I believe this will generate divergent branches.
I will do some experiments to see what the timing impact of the different options are. Intuitively it feels like going through every row is likely to be slow given the size of the array.