Revised edition:
__global__ void cu_find_minima(double *A, int *minima)
{
// find the minimum for each block
const int i = threadIdx.x;
const int j = blockIdx.x;
const int t = j*blockDim.x + i;
double my_value;
__shared__ double min_value;
// Read value from input array - note that in many scenarios, this would be replaced
// by a computation of some kind or a serial min of several computations
my_value = A[t]; // A[j][i]
min_value = my_value;
syncthreads(); // can be omitted in a single warp scenario
while (my_value<min_value)
{
// race condition: multiple writes to min_value, but we don't care!
min_value = my_value;
syncthreads(); // can be omitted in a single warp scenario, but for multiple warps this prevents race to the conditional
}
if (my_value==min_value)
{
minima[j] = i; // save the minimum for each block
}
}
Adding syncthreads() generalizes my previous post to work with an arbitrary number of warps instead of a single warp.
If syncthreads is omitted, this only works works when there is only one warp, (which was the original stated problem). This is because a race condition can cause min_value to be replaced by a larger value after the conditional is executed. As previously noted, the race condition for the assignment itself can be safely ignored.
I believe this will run faster than the other suggestions, particularly in the case where the values to be scanned are computed per thread in the kernel. It performs fewer than log[sub]2/sub parallel passes (assuming sufficient number of cores) and each pass only performs one compare and one conditional write. After the first pass, most of the half warps will unanimously skip the write operation. Normally we don’t consider a conditional to be beneficial, but when all threads in a half-warp agree, the conditional actually provides a significant savings. Also, there is no excess overhead associated with a temporary storage buffer, which is important if the minimum calculation is being performed on values calculated per-thread by the kernel as opposed to the operation being performed on input data.
LSChien’s solution seems to come closest with 15 instructions + 2 __syncthreads(). The above code encounters an average of 4 syncthreads, however in the single warp case, I omit the syncthreads, and average case is 8 operations (4 compares and 4 writes). Admittedly, worst case is 32 compares and 32 writes, but I’m more concerned with average case. If I encountered a pathological scenario where the data tends to be sorted in an unfavorable manner, I would simply reverse the thread order and expect an even lower average case!
As njuffa pointed out, classical reduction is not optimal for this situation. Min is not the same as Add, so you should not have to work as hard to do it.
Another variation, per njuffa’s comments, each thread could serially compute (or read from an input buffer) several values, keeping track of the minimum, then proceed to the above code.
Yet another variation: if all but one MPU can be kept busy on other streams, we can dedicate one warp to the min operation, each thread computing 1 of every 32 values in series, and then apply the above solution with syncthreads() omitted (because syncthreads is not necessary for the single warp scenario).
There seems to be some weird magic going on here since a basic understanding of reductions tells us not to expect better log[sub]2/sub=5 passes. Note that I am counting the initial write and the final compare in addition to 3 passes through the inner loop, for a total of 4 compares and 4 writes.
So how does this algorithm beat log[sub]2/sub? The answer is that we are gambling, but the odds are stacked in our favor. In each pass we are just as likely to have good luck as bad luck, but good luck yields more benefit than the bad luck costs. I haven’t worked out the mathematics (exercise left for the reader) but I get the following empirical result:
k = 2/3 * (1 + log[sub]2/sub)
where n is the number of threads (with one value per thread) and k is the average number of parallel passes required.
Enjoy,
Ken