I would implement it this way:
max_f = warp_butterfly_reduction_max_f32(f)
mask = __ballot(f == max_f)
lane = bfind.u32(mask)
That’s 13 ops: shfl/max x 5 + setp + vote.ballot + bfind.u32.
This will return the highest lane index in case there are multiple matches.
You might want to convince yourself that your max.f32 (or max.ftz.f32) and setp.eq.f32 (or setp.eq.ftz.f32) are going to work with your input data. @njuffa can probably describe a situation where the reduction wouldn’t match a lane value?
I was thinking about NaN’s and/or accidentally mixing in .ftz’s. I think there is nothing to worry about if you don’t do anything dumb in PTX. :)