most of the implementations for sorting numbers use a comparison, followed by a conditional swap of the compared values. Sorting networks traditionally use this method and they can nicely be parallelized.
I am currently looking into GPGPU (using shaders) in addition to CUDA, and I was wondering if anyone knows techniques for conditionally swapping two variables
possibly without using extra temporary registers and
without using any branching or predication
This would be used in optimization problems, e.g. determining which color channel (taken from one or more textures or data arrays) provides the “best” (i.e. highest) value. The color channels could represent engineering data, e.g. signal strength received from multiple sources.
An implementation of this would likely result in speed-ups using CUDA, as well as make it possible to do it in branchless shader code.
Here is the trick to work without an intermediate register.
Swapping variables a, b
a = a + b
b = a - b
a = a - b
It works for ints, if overflow is correctly handled.
It works for floats, however if a is large or b is small, there will be precision problems.
Now the question remains how to make it conditional and branchless. I was thinking about using a multiplication factors in front of some expression that determines whether a swap will take place or not.
For shader code, I was just thinking that I could use a LRP (linear interpolation) between a and b and b and a respectively… If I set the lerp factor to 0, I get a - if I set it to 1 I get b. A previous comparison will determine the lerp factor either as 0 or 1. I may need to use a temporary register with this approach.
Compare and swap is a fundamental instruction in PTX. I haven’t actually looked into how this is actually handled by the nvidia compiler, but it would be possible to compile the statement D = ( A < B ) ? C : E; into the following instruction sequence:
SET.lt.s32.s32 temp, A, B;
SLCT.s32.s32 D, C, E, temp;
Which has no branches…
You use a temp register, but who cares? The live range of the register is only those two instructions so an intelligent register allocator will reuse it immediately.