Branchless sorting possible?


most of the implementations for sorting numbers use a comparison, followed by a conditional swap of the compared values. Sorting networks traditionally use this method and they can nicely be parallelized.

I am currently looking into GPGPU (using shaders) in addition to CUDA, and I was wondering if anyone knows techniques for conditionally swapping two variables

  1. possibly without using extra temporary registers and
  2. without using any branching or predication

This would be used in optimization problems, e.g. determining which color channel (taken from one or more textures or data arrays) provides the “best” (i.e. highest) value. The color channels could represent engineering data, e.g. signal strength received from multiple sources.

An implementation of this would likely result in speed-ups using CUDA, as well as make it possible to do it in branchless shader code.


Here is the trick to work without an intermediate register.

Swapping variables a, b

a = a + b
b = a - b
a = a - b

It works for ints, if overflow is correctly handled.

It works for floats, however if a is large or b is small, there will be precision problems.

Now the question remains how to make it conditional and branchless. I was thinking about using a multiplication factors in front of some expression that determines whether a swap will take place or not.

You can use min and max functions to get rid of the conditionals.

But I can’t think of how to do that in-place without using additional memory for the intermediate value.

For shader code, I was just thinking that I could use a LRP (linear interpolation) between a and b and b and a respectively… If I set the lerp factor to 0, I get a - if I set it to 1 I get b. A previous comparison will determine the lerp factor either as 0 or 1. I may need to use a temporary register with this approach.


Using bit-ops might get you around the precision & overflow problems?…ingValuesSubAdd

Oh that’s one cool collection of hacks on the web site. Thanks for sharing.

When using the _float_as_int() trick this could also work for floats without losing precision. Hmm…


Yeah, I think it will be a very fast way to save a register (but only if you do not need that extra register later on in your kernel ;))

Compare and swap is a fundamental instruction in PTX. I haven’t actually looked into how this is actually handled by the nvidia compiler, but it would be possible to compile the statement D = ( A < B ) ? C : E; into the following instruction sequence: temp, A, B;
SLCT.s32.s32 D, C, E, temp;

Which has no branches…

You use a temp register, but who cares? The live range of the register is only those two instructions so an intelligent register allocator will reuse it immediately.

Also, if you are interested in sorting using CUDA, read these papers: