Branchless sorting possible?

cbuchner1 · January 23, 2009, 1:17pm

Hi,

most of the implementations for sorting numbers use a comparison, followed by a conditional swap of the compared values. Sorting networks traditionally use this method and they can nicely be parallelized.

I am currently looking into GPGPU (using shaders) in addition to CUDA, and I was wondering if anyone knows techniques for conditionally swapping two variables

possibly without using extra temporary registers and
without using any branching or predication

This would be used in optimization problems, e.g. determining which color channel (taken from one or more textures or data arrays) provides the “best” (i.e. highest) value. The color channels could represent engineering data, e.g. signal strength received from multiple sources.

An implementation of this would likely result in speed-ups using CUDA, as well as make it possible to do it in branchless shader code.

Christian

cbuchner1 · January 23, 2009, 1:34pm

Here is the trick to work without an intermediate register.

Swapping variables a, b

a = a + b
b = a - b
a = a - b

It works for ints, if overflow is correctly handled.

It works for floats, however if a is large or b is small, there will be precision problems.

Now the question remains how to make it conditional and branchless. I was thinking about using a multiplication factors in front of some expression that determines whether a swap will take place or not.

mborgerd · January 23, 2009, 4:20pm

You can use min and max functions to get rid of the conditionals.

But I can’t think of how to do that in-place without using additional memory for the intermediate value.

cbuchner1 · January 23, 2009, 4:55pm

For shader code, I was just thinking that I could use a LRP (linear interpolation) between a and b and b and a respectively… If I set the lerp factor to 0, I get a - if I set it to 1 I get b. A previous comparison will determine the lerp factor either as 0 or 1. I may need to use a temporary register with this approach.

Christian

E.D_Riedijk · January 23, 2009, 8:58pm

Using bit-ops might get you around the precision & overflow problems? http://graphics.stanford.edu/~seander/bith…ingValuesSubAdd

cbuchner1 · January 23, 2009, 9:11pm

Oh that’s one cool collection of hacks on the web site. Thanks for sharing.

When using the _float_as_int() trick this could also work for floats without losing precision. Hmm…

Christian

E.D_Riedijk · January 23, 2009, 11:10pm

Yeah, I think it will be a very fast way to save a register (but only if you do not need that extra register later on in your kernel ;))

Gregory_Diamos · January 24, 2009, 4:31pm

Compare and swap is a fundamental instruction in PTX. I haven’t actually looked into how this is actually handled by the nvidia compiler, but it would be possible to compile the statement D = ( A < B ) ? C : E; into the following instruction sequence:

SET.lt.s32.s32 temp, A, B;
SLCT.s32.s32 D, C, E, temp;

Which has no branches…

You use a temp register, but who cares? The live range of the register is only those two instructions so an intelligent register allocator will reuse it immediately.

Gregory_Diamos · January 24, 2009, 5:37pm

Also, if you are interested in sorting using CUDA, read these papers:

Topic		Replies	Views
branchless exchange based on condition ? CUDA Programming and Performance	1	1039	February 9, 2009
Fastest way to swap floats and integers? also looking for conditional swaps CUDA Programming and Performance	2	5894	July 18, 2008
how to avoid branchings in kernels? CUDA Programming and Performance	0	531	May 24, 2013
Is there efficient way to deal with if/else in the kernel CUDA Programming and Performance	4	14291	June 14, 2009
Avoiding branching with an arithmetic trick... CUDA Programming and Performance	14	4871	March 9, 2013
cuda integer operations and simt for sorting CUDA Programming and Performance	7	8972	July 25, 2009
No conditional statements Ternary operator or bit twiddling or predicate instructions? CUDA Programming and Performance	7	2040	June 15, 2011
WARP Level Sorting Efficient? CUDA Programming and Performance	8	2711	July 3, 2010
switch construction from C compiled to sequential if elseif elseif ... CUDA Programming and Performance	11	8318	August 1, 2008
Overhead of warp divergence vs. extra multiplication by 0 or 1 CUDA Programming and Performance	9	2356	February 27, 2013

Branchless sorting possible?

Related topics