Reduction question, searching for the lane index of the thread with the highest value

cbuchner1 · August 18, 2015, 4:06pm

Hi,

An interesting reduction question.

We’re storing 32 float values distributed across threads of each warp.
Now per warp we want the lane index of the thread with the highest value.

Is there a particularly efficient way of getting this result?

Christian

njuffa · August 18, 2015, 7:34pm

I haven’t thought deeply about it, but couldn’t you just use the basic __shfl()-based maximum-per-warp search ([url]https://devtalk.nvidia.com/default/topic/534848/cuda-programming-and-performance/shfl-function-in-kepler/post/4030883/#4030883[/url]), but instead of max() use an if-then-else that propagates both the value and an accompanying index?

More efficient would likely be this: stuff the value into the high bits of an integer, the lane index into the lowest five bits, then use the regular shuffle-based maximum search. After the maximum has been computed, extract maximum value and index. In case of multiple lanes holding the maximum value, this would return the index of the highest lane holding that value. If you want the lowest lane index instead, store 32-lane_idx into the low-order five bits.

This assumes you can somehow easily compress your ‘float’ data into 27 bits, which may or may not be possible. If the range of the data is limited, the easiest way would probably be to remove the sign and some exponent bits to achieve the compression.

allanmac · August 18, 2015, 8:18pm

I would implement it this way:

max_f = warp_butterfly_reduction_max_f32(f)
mask  = __ballot(f == max_f)
lane  = bfind.u32(mask)

That’s 13 ops: shfl/max x 5 + setp + vote.ballot + bfind.u32.

This will return the highest lane index in case there are multiple matches.

You might want to convince yourself that your max.f32 (or max.ftz.f32) and setp.eq.f32 (or setp.eq.ftz.f32) are going to work with your input data. @njuffa can probably describe a situation where the reduction wouldn’t match a lane value?

I was thinking about NaN’s and/or accidentally mixing in .ftz’s. I think there is nothing to worry about if you don’t do anything dumb in PTX. :)

cbuchner1 · August 19, 2015, 5:15am

Thank you allanmac, I was hoping that warp voting would be useful in this context - you just confirmed it.

cbuchner1 · August 19, 2015, 8:47am

I found some macro that I can use in CUDA code to access the bfind PTX instruction

// __bind(unsigned int i): Find the most significant bit in a 32/64 number (PTX).
__device__ __forceinline__ int __bfind(unsigned int i) const { int b; asm volatile("bfind.u32 %0, %1;" : "=r"(b) : "r"(i)); return b; }

I’ve already got some an implementation of a butterfly reduction.

This should be a breeze. Thanks.

SPWorley · August 19, 2015, 4:32pm

Could this be faster?

max_f = warp_butterfly_reduction_max_f32(f);
if (f == max_f) shared[warpID]=myLane; 
// no syncthreads(), uses warp synchronous architecture assumption
lane  = shared[warpID];

The write in line 2 may have multiple threads writing to the same location, but ONE is guaranteed to succeed so the net behavior is correct.

allanmac · August 19, 2015, 5:07pm

I don’t know how ballot() performance compares to a predicated smem store + broadcast load.

I’ve used both approaches in my code though!

Trivia: pre-Maxwell the highest lane would “win” the smem store. That appears to no longer be the case.

One other point for @cbuchner1 – I implied you needed PTX but, as I’m sure you’ve determined, you can implement your reduction entirely with intrinsics.

scottgray · August 20, 2015, 4:18am

I believe the vote instruction is the fastest returning instruction on the device. On Maxwell it only requires 2 clocks to execute.

I recently had to write some similar code. I had each thread in the warp calculating a lookup table entry in shared memory, but some of the threads could generate empty entries and I wanted the lookup table holes filled in by shifting things left as needed. So I used a vote/popc combination with some bit mask logic.

I almost went with a complicated warp shuffle approach but then realized vote was all I needed. Handy little instruction.

Topic		Replies	Views
Determine max across a warp; CUDA Programming and Performance	0	986	August 22, 2014
shfl function in kepler CUDA Programming and Performance	11	6819	November 26, 2013
reduction in registers CUDA Programming and Performance	2	841	March 20, 2013
find maximum value in an array along with index CUDA Programming and Performance	40	19170	October 11, 2010
How to realize that CUDA Programming and Performance	6	1962	February 16, 2010
different output every time I run my code probably wrong in finding max value. CUDA Programming and Performance	4	3122	June 21, 2011
Find the largest value among all threads result In CPU or GPU? CUDA Programming and Performance	8	8456	November 29, 2007
Find maximum value from threads CUDA Programming and Performance	6	591	December 16, 2023
Cumpute Max of Vector or Matrix CUDA Programming and Performance	7	3846	June 6, 2011
Max() function? CUDA Programming and Performance	1	4000	April 29, 2009

Reduction question, searching for the lane index of the thread with the highest value

Related topics