cuda integer operations and simt for sorting

WIth cudas simt the idea is that a single instruction is run in parallel on a bunch of cores.
Is integer division a “non branching” operation in the sense that it can be run in simultaneously in an entire warp without causing branching?
How about type casting?
How about bit manipulation
The idea is that if it is one can use cuda to do parallel sorting of positive integers as follows:
put integer a in memory slot 0
put integer b in memory slot 1
int c = ( a-b );
int d = c & 0x7fffffff; // d is now a positive version of c
int index = ((c/d) + 1 )/2; (index is 1 or 0 depending on whether a is larger or b is larger, note that this particular method requires a and b to be distinct values)
int min = member at memory slot index
int max = member at memory slot 1 - index
a = min
b = max // now a is the smaller and b is the larger of the original two

I am sure there is a slicker version than this that doesnt require distinctness, possibly using something like
index = (a/b); // at this point if a is larger we have a positive, otherwise it is zero
index = (3index)/(3index - 2); // if index was 0 it is still 0, if it was any other positive integer it is now 1
int max = member at memory slot index
int min = member at memory slot 1 - index
a = min
b = max

Anyway, the idea is that one should be able to do very fast sorting in cuda if one can simply put two elements in order without causing warps to split (each thread putting its two in order).
Effectively a very fast bubble sort is a minimal possibility (first thread works on elements 1,2 second thread works on elements 3,4, … on next run 1st thread works on elements 2,3 next thread on elements 3,4, … on next run
1st thread works on elements 1,2 next thread on elements 3,4 …

A shell sort based on such a bubble sort is another possibility.

First off, Will this work (without branching, that is)?

Secondly, any suggestions on fast sorting using cuda? It should be possible to sort VERY fast, but I am fairly new to cuda.

First, a little branch or conditional execution is not an issue with CUDA. And anyway faster than using int32 multiplications or worse divisions!!!

Second, the problem of sorting large array in CUDA is not a matter of branching but -mainly- a matter of memory latencies, each time you read from Global Memory your threads may be blocked for 200 to 600 cycles, and even if it’s better when writing you will hit a penalty too.
Anyway CUDA-devices Global Memory is far slower from CPU CACHED memory when the CPU cache have a good hit ratio (so for problems in the order of MBs to 10-20MB).

There’s a bitonic sort on nVidia’s CUDA example (w/ source-code) : CUDA Bitonic Sort

And many other great sort algorithms (mainly Radix Sort derived) that you will find on the forum here :-)

Interesting. Thanks for pointing out the other sort sources. No, I didn’t realize that integer division took so many clock cycles. Thanks for the heads up. Yes, the limitation of shared memory size does mean moving things

around (Am I correct that memory is only shared within a single multiprocessor, and the only way to move it around from one shared memory to another within a kernel is to copy it to global memory and back?)

For sorting large arrays, I recommend you use the radix sort included in the CUDA SDK (2.2 and later) and described in our paper available here:



Sadly, it’s the only way to communicate between different multiprocessor (in fact there’s another if you have MCP79 IGP or GT200 GPU: main computer Mapped Pinned Memory, but it’s worse! lol!) : copy data to global memory.

Your reply to my original sorting post was quite helpful. If the name of the game is latencies, where can one get an idea of what the real latency rules are. I don’t think there was any indication in the guide about the cost of integer multiplication or division. How would I have been able to find that out myself? How would I find out what all the latency factors are and precisely how severe they are? I have heard writing highly optimized C code described as being like chess. Where would I find the “rules of the game” for fast cuda above and beyond what is in the guide (i.e. at the level you are talking about)?


I will have to take a careful look at that. Since you are publishing the algorithm I need to ask: Is the algorithm headed for a patent or is it openly available for use or something in between those?

There’s a great document that sum up many best practices for efficient CUDA programming, you will find it on thread: General CUDA GPU Computing discussion, it’s CUDA Best Practices Guide 2.3 (PDF) and it mainly covers optimizing for CUDA.

Instruction latencies are on the other documentations, but the worst is the Global Memory latency, and it depend on many factors: kinda memory, frequency, and how-much parallel memory accesses are done. Anyway it’s between 200 and 600 GPU cycles depending on these factors, so you are more than an order of magnitude over any other operation!

You could hide multiplication or division latencies launching 6 warp or more per Block (means 192+ threads per Multi Processor), but Global Memory accesses (or “local”) couldn’t be hide that way. You may consult this post on “Macro-threading” for CUDA GPU, that explain how to overlap Global Memory IO while using 100% GPU Cycles.