Branchless Bitonic Sort with float keys

I already have a bitonic sort implementation of integers based on GPU Tech Conference paper of (Kepler) Shuffle: Tips and Tricks.
The problem I have at hand is sorting a large number(~2Million) of small sets(128-256). I would like to implement this as each thread sorting one set at a time and avoid thread divergence, obviously.
How can I achieve this when the data sets have float keys (thus the bit tricks of the integer version is no longer applicable)?

First of all, if you are asking to do a sort in a single thread, that is mostly not a CUDA programming question. I wouldn’t recommend that anyway. (For single thread sorting of the type you are asking about, I would investigate sorting networks). The only CUDA intersection would probably have to do with data organization (e.g. tranpose your input matrix so data sets are columnar in memory).

My suggestion would be to sort each set using at least a warp, and possibly a threadblock (128-256 threads). There are a variety of benefits to this approach, including the ability to naturally get some benefit from coalescing, as well as collective thread behavior (e.g. use of warp collective functions, shared memory, etc.). Before going any further, if it were me, I would benchmark against a cub block sort per set.

Also: AFAIK, there are ways to order floating point values if the job is done carefully, using integer-style arithmetic. Therefore I’m not sure what bit tricks of integer you are referring to. For example, AFAIK a typical radix-sort implementation can be applied to float values, using the traditional bitwise approach, if you wish.

Note that the accepted answer at the Stackoverflow question linked by Robert Crovella above dispenses poor advice. Reinterpreting two float operands as unsigned 32-bit integers and then comparing them with an integer comparison will not work correctly for arbitrary numeric floating-point operands.

Assume we want to compare 1.0f and -1.0f this way. They map to 0x3f800000 and 0xbf800000, respectively. When using an unsigned comparison, clearly 0x3f800000 < 0xbf800000, which would mean that 1.0f < -1.0f, which is obviously not the desired result.

Functionally correct solutions would be slightly more involved. For example, if we can exclude the presence of NaN (not a number) encodings with certainty, we could read float data as int (that is, signed 32-bit integers), and then implement a float-style “less than” comparison like this, for example:

__device__ int lt_fp32 (int ix, int iy)
{
    int r = ((ix & iy) & 0x80000000) ? (iy < ix) : (ix < iy);
    if ((ix | iy) == 0x80000000) r = 0;
    return r;
}

I am not at all convinced that using such bit-trickery on float data is necessary to implement a bitonic sort, but maybe that is the case in very specific contexts.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.