GPU sort


I need to sort array of ~1 milion floats, this array is already on,
comparator is ‘>’
currently i’m using bitonic sort to sort 512 floats subarrays, then i’m using
‘merge’ sort to merge subarrays over and over again
(i’m caching both reads & writes with shared memory)
The average time of executing all the kernels (+ reading last element to host memory to be sure that the kernel has finished) is 0.37 sec. on my 8800 GT.

Can i do it better ? Are there better algorithms to sort on GPU ?
Thanks for any ideas :)

That’s pretty good already, considering that an odd-even sort on 1 Million floats takes 40s+ on the GPU, even after heavy optimization ;)

Did you compare this with quicksort on the CPU at all?

Well the CPU version takes about ~0.1 second on one core of core2 quad CPU.

Thats why i’m seeking more efficient GPU sort because the sorted data is then subsequently used by GPU in another kernels and cudaMemcpy’ing forth and back
is not a vise decision here.

The (modified) radix sort from particle example is probably not a solution here
because i’m sorting floats :/

Did you try cudpp?

Yes, the cudpp ‘sort’ is slightly slower than my own at the moment (about 1 or 2%)

One more question slightly out of topic (but related in my case).

when i have one chunk of GPU memory, and i’v mapped it as a texture (1d)

then in kernel i’m reading from this texture, processing data, then writing to

that same chunk of memory (i’m avare here that the stalle data will be in texture cache now)

Then in another kernel i’m sampling from that same texture, do the texture cache

gets invalidated automatically and no more stalle data is in the cache ? or do i need

to do some magic to flush the texture cache (probably rebinding the texture will be enough ?)

As far as I know the flushing happens between successive kernel calls.

Note that is is possible to modify radix sort to sort floating point numbers:

(You can ignore Google’s warning on this page, btw).

We should have an updated radix sort in the next SDK release.