Actually with this problem, you can avoid the complexities of parallel reduction. The overhead is not using atomicMin, it’s just using it so many times.
So have each thread keep its own minimum, and then at the end, just once, do the atomic min. The overhead is negligible then and your code complexity will drop enormously.
int minVal=0x7FFFFFFF; // per thread minimum
for (int i=threadIdx.x; i<maxN; i+=blockDim.x)
minVal=min(minVal, a[i]);
atomicMin(&s_index, minVal);
With the tweak above, you’ll use only a few atomicMins (equal to the number of threads, so perhaps 256, which is negligible). With the atomicMin inside the loop, you’d use maxN atomicMins, which could be huge if your array is big.
You won’t see any performance difference between the above trivial code and the parallel reduction unless maxN is smaller than a few thousand.
actually the maxN is only 256 in my case, but this function is used many times so I’d like to find an efficient way to implement this. So as maxN is small here, which one would be faster? reduction or your code?