It might even be worth implementing case 1 and 2 as reductions to avoid maximum degree bank conflicts: Use an array of shared values any[16] and write with each thread of a half-warp to a different bank. After all threads have finished writing use a light weight reduction without syncthreads() to determine the final result.
Yes, I would also do all of them in a reduction (if you need to do it on the same data, that might even be faster)
I have e.g. made a kernel that calculates mean, min, max and standard-deviation of a couple of array of values. It is all basically a reduction and very fast.
__shared__ volatile int minima;
minima = INFINITY;
local_value = expression(threadIdx.x);
for(i=0; i<N; i++)
// N is the number of values (thread groups) u have to compare.
{
if (local_value < minima)
minima = local_value;
else
break;
}
__syncthreads();
At the end of the code “minima” would have what you desire. But I dont think this will be efficient if your “thread groups” is as big as 512 (like one value to compare for each thread)