Segmented Reduction of small subarrays

I wasn’t very clear when I said you don’t really need temp_val. What I meant was you could do something like:

  for (int offset = 16; offset > 0; offset >>= 1)  
       val = min(val, __shfl_down_sync(0xFFFFFFFF, val, offset));

But its not necessarily “better” than what you have.