Fastest Sorting on CUDA?

I have tried several sorting algorithm on CUDA, but the result is not that good I expected.

I sort 16M numbers on a 9800GT GPU and Q6600 Intel CPU

Radix Sorting from NParticles in CUDA SDK: 555ms
CUDPP Global Radix Sort: : 566ms
The C qsort : 2200ms
C++ STL : 1141ms

Does anybody knows faster algorithm than the Radix Sorting by Scott Le Grand.

I think the current fastest is:

An implementation of this is included in the “smokeParticles” sample in the CUDA 2.1 SDK.

So Cool this implementation , Thank you very much.

There is also this paper on a form of Quicksort optimized for the GPU (they test the 8800 GTX and 8600GTS):

Quick update: Michael Garland tells me that he has just published a pre-print of their IPDPS paper here, which has a better explanation of the algorithm and more recent performance data:

Thank you Simon.

In the paper I also noticed a method by Intel guys, and GTX200 is on average 23% faster than the multicore CPU merge sort. But if they test their method on Intel 8 Core Xeon CPU, the CPU method may beat the GPU method.