WIth cudas simt the idea is that a single instruction is run in parallel on a bunch of cores.

Is integer division a “non branching” operation in the sense that it can be run in simultaneously in an entire warp without causing branching?

How about type casting?

How about bit manipulation

The idea is that if it is one can use cuda to do parallel sorting of positive integers as follows:

put integer a in memory slot 0

put integer b in memory slot 1

int c = ( a-b );

int d = c & 0x7fffffff; // d is now a positive version of c

int index = ((c/d) + 1 )/2; (index is 1 or 0 depending on whether a is larger or b is larger, note that this particular method requires a and b to be distinct values)

int min = member at memory slot index

int max = member at memory slot 1 - index

a = min

b = max // now a is the smaller and b is the larger of the original two

I am sure there is a slicker version than this that doesnt require distinctness, possibly using something like

index = (a/b); // at this point if a is larger we have a positive, otherwise it is zero

index = (3*index)/(3*index - 2); // if index was 0 it is still 0, if it was any other positive integer it is now 1

int max = member at memory slot index

int min = member at memory slot 1 - index

a = min

b = max

Anyway, the idea is that one should be able to do very fast sorting in cuda if one can simply put two elements in order without causing warps to split (each thread putting its two in order).

Effectively a very fast bubble sort is a minimal possibility (first thread works on elements 1,2 second thread works on elements 3,4, … on next run 1st thread works on elements 2,3 next thread on elements 3,4, … on next run

1st thread works on elements 1,2 next thread on elements 3,4 …

A shell sort based on such a bubble sort is another possibility.

First off, Will this work (without branching, that is)?

Secondly, any suggestions on fast sorting using cuda? It should be possible to sort VERY fast, but I am fairly new to cuda.