Reduction question, searching for the lane index of the thread with the highest value


An interesting reduction question.

We’re storing 32 float values distributed across threads of each warp.
Now per warp we want the lane index of the thread with the highest value.

Is there a particularly efficient way of getting this result?


I haven’t thought deeply about it, but couldn’t you just use the basic __shfl()-based maximum-per-warp search (, but instead of max() use an if-then-else that propagates both the value and an accompanying index?

More efficient would likely be this: stuff the value into the high bits of an integer, the lane index into the lowest five bits, then use the regular shuffle-based maximum search. After the maximum has been computed, extract maximum value and index. In case of multiple lanes holding the maximum value, this would return the index of the highest lane holding that value. If you want the lowest lane index instead, store 32-lane_idx into the low-order five bits.

This assumes you can somehow easily compress your ‘float’ data into 27 bits, which may or may not be possible. If the range of the data is limited, the easiest way would probably be to remove the sign and some exponent bits to achieve the compression.

I would implement it this way:

max_f = warp_butterfly_reduction_max_f32(f)
mask  = __ballot(f == max_f)
lane  = bfind.u32(mask)

That’s 13 ops: shfl/max x 5 + setp + vote.ballot + bfind.u32.

This will return the highest lane index in case there are multiple matches.

You might want to convince yourself that your max.f32 (or max.ftz.f32) and setp.eq.f32 (or setp.eq.ftz.f32) are going to work with your input data. @njuffa can probably describe a situation where the reduction wouldn’t match a lane value?

I was thinking about NaN’s and/or accidentally mixing in .ftz’s. I think there is nothing to worry about if you don’t do anything dumb in PTX. :)

Thank you allanmac, I was hoping that warp voting would be useful in this context - you just confirmed it.

I found some macro that I can use in CUDA code to access the bfind PTX instruction

// __bind(unsigned int i): Find the most significant bit in a 32/64 number (PTX).
__device__ __forceinline__ int __bfind(unsigned int i) const { int b; asm volatile("bfind.u32 %0, %1;" : "=r"(b) : "r"(i)); return b; }

I’ve already got some an implementation of a butterfly reduction.

This should be a breeze. Thanks.

Could this be faster?

max_f = warp_butterfly_reduction_max_f32(f);
if (f == max_f) shared[warpID]=myLane; 
// no syncthreads(), uses warp synchronous architecture assumption
lane  = shared[warpID];

The write in line 2 may have multiple threads writing to the same location, but ONE is guaranteed to succeed so the net behavior is correct.

I don’t know how ballot() performance compares to a predicated smem store + broadcast load.

I’ve used both approaches in my code though!

Trivia: pre-Maxwell the highest lane would “win” the smem store. That appears to no longer be the case.

One other point for @cbuchner1 – I implied you needed PTX but, as I’m sure you’ve determined, you can implement your reduction entirely with intrinsics.

I believe the vote instruction is the fastest returning instruction on the device. On Maxwell it only requires 2 clocks to execute.

I recently had to write some similar code. I had each thread in the warp calculating a lookup table entry in shared memory, but some of the threads could generate empty entries and I wanted the lookup table holes filled in by shifting things left as needed. So I used a vote/popc combination with some bit mask logic.

I almost went with a complicated warp shuffle approach but then realized vote was all I needed. Handy little instruction.