Best sorting algorithm using CUDA Sorting networks, merge or radix sort?

For a CUDA application I’m in need for a fast sorting algorithm to sort coordinates lexicographically. I know CUDPP provides a sorter, in the form of a RADIX sort combined with a merge sort.

Previous GPGPU sorters were generally based on sorting networks, which have complexity Nlog^2 N and quite some overhead.

What complexity has the CUDPP sorter? What are the advantages and disadvantages compared to a sorting network?

The best sorting algorithm depends somewhat on your application - whether you need to sort key/value pairs or just keys, the type of your data, whether you can sort incrementally etc. etc.

I believe the fastest published CUDA sort for 32-bit integers is Scott LeGrand’s radix sort, which is described in his GPU Gems 3 article (chapter 32):

The code for a version of this is included in the particles demo in the SDK.