Code: efficient sorting with CUDA

Click to download:
The package includes source, examples, a paper published and a brief introduction.

This library is distributed as original code, you have to include the header “bbsort.cuh” in your project,
and call the bbsort() functions in your c/cpp/cu files.

You may also have to make some changes to customize the code for your applications. Following are some tips.

1, if you want to sort float, int, uint, short, ushort, you have only to change “” near line 300
Please follow the images contained in sort_float-int-short-double_setups.

2, if you want to sort structures, you have make 4 changes in “”, “bbsort.cuh” and “”
Please follow the images contained in sort_structure_setups.

3, if you want to sort doubles, after having the change 1, you still to add arch=sm_13 to your custom build command.
And you also should have the cards with computing 1.3 capbility.

the gaussian_table.dat can be used for generating gaussian distributed list.

Any question please contact:

Thank you, this is interesting

:rolleyes: , if you find any bugs, please report them to me:

Thank you!~

Nice Code!~


great work! but I have some problems when I try to compile the example for floats for example in VS2005:

1> : error C2065: ‘assignElementToSlicesNearlySortedD’ : identifier not found
1> : error C2065: ‘assignElementToSlicesD’ : identifier not found
1> : error C2065: ‘assignElementToBucketD’ : identifier not found
1> : error C2065: ‘bitonicSortD’ : identifier not found

what I’m doing wrong? if you find the solution please tell me, thanks a lot

Hey, this is terrific! Thanks so much for sharing and documenting so well.

Sorting always seems to be a boring problem, but in practice it’s quite fascinating with interesting techniques and bottlenecks, and many strategies to explore.

A couple questions:
You compare to GPUquicksort and GPUradixsort. But you just list these names without reference to what these are… are they the radix sort from CUDPP?

And have you compared your sorts against the work from Satish, Harris, and Bell? They’re getting about 200M key-only sorts/second on a GTX280 , about 140M key/value sorts/sec.
Their code is available, though tucked into the CUDA 2.2 SDK in the smokeParticles project.

I’d also love to see those speed graphs on a log scale… specially for smaller N like 5K, 10K, 50K, 100K. That’s a domain where radix sorting doesn’t do so well, and implementation overhead dominates. Yet 50K is a very common sorting size, especially for applications like real time SPH.

From the graphs, it’s clear that your sort has reached a speed equilibrium by N=4M, but we can’t see any performance details in the interesting range below that.

Also, for HUGE sorts of say 100M, do you start getting an O(NlogN) behavior or even O(N^2)? The initial linear bucketing steps feel like they’ll start losing effectiveness as more and more buckets get lopsided counts, and that may cause more inefficiency with even larger N. Your linear graphs don’t show any efficiency decay even up to N=24M so maybe this effect is effectively hidden.

It would be easier to see if the graphs were all showing keys/sec for different N, instead of total time for each N.

I understand that you can sort 24 million numbers (floats4, int4) within a half second on a GeoForce 280 GTX.
(FIG 3., (B)
You mention on the same graph that the STL (standard Template library, Microsoft?) quicksort
would need 8 secs.
so 0.5 sec on GPU vs 8 sec on a CPU (1 core on a 2.4GHZ Intel Q6600)
Is this right?
The STL Quicksort probably uses recursion and function pointers.
This is expensive. If you were using a quicksort without this,
you would see that a sort of 24 Million numbers would take less than 2 seconds, on one core of a Q6600.
One might achieve 0.5 secs on 4 cores, or somewhat more.
So where is the efficiency, since a 280 GTX costs about the double of a Intel Q6600.

Please give us the code able to do that.

Chhugani’s CPU sorting algorithm sorts about 50M/sec on one core, about 110M/sec on four cores of a 3.2Ghz Q9550 Core 2.

That’s a good paper although I cannot find the available C code.

Clearly, CPUs easily beat GPUs for those applications in which the small ratio of computational operations to memory accesses is small.

However, this does not mean CUDA sorting is useless. When sorting is just a part of a bigger algorithm, the story may change.

Err, don’t forget that GPUs have an order of magnitude more bandwidth than a CPU does. (285GTX ~= 160GB/sec, Core 2 ~= 10GB/sec, i7 ~=25GB/sec). There are lots of applications which use little compute, but are memory limited, and the GPU does great because of that bandwidth. And sorting is exactly such an application where GPUs destroy CPUs.

Read the Satish paper, the 280GTX can sort roughly twice as fast as a quad-core CPU using Chhugani’s optimized code.

What CPUs excel in are cases where single threaded speed is paramount (they have higher clocks, speculative execution, parallel operations, etc, everything they can do to get linear code to go fast). CPUs also have a very different memory cache structure which can be very effective for some applications like compression, where the compute needs more local state… compare the 16kb shared memory of a CUDA SM to a 2MB CPU L2 cache (or larger, counting L3), plus the CPU’s smooth and efficient buffering of global memory using that cache transparently.

You’ve agreed with me.

The bandwidths you told are only for reading. Writes to global memory in CUDA are extremely slow, and sorting requires a lot of writes. BTW, 2x is not a good speed-up.

I’ve measured write bandwidth of ~70 GB/s compared to ~130 GB/s read bandwidth on a GTX 280 (theoretical peak is 141.7 GB/s). I wouldn’t call that “extremely slow”.

Sorting is an interesting application because it’s a case where having a few MB of cache at your disposal boosts your effective memory bandwidth by a significant multiple. With only 16KB of shared memory per SM, GPUs must operate on smaller chunks of data and therefore perform more memory transactions per data element.

So it’s not “extremely slow writes” that limits GPU sorting performance, it’s that GPUs have much smaller local stores than CPUs and therefore consume more memory bandwidth per data element. This eats away much of the GPU’s raw memory bandwidth advantage over CPU systems.

Thanks for your correction.

Please find the attached nrquicksort.txt file, which you rename in nrquicksort.c.

The code I have derived from something I once have found by googling for

non recursive quicksort

After compiling with VC2008 (gnu works as well, but is the exe is slower)

you’ll see that it should sort 24 Mio DOUBLES in 2500 Milliseconds

on one CORE of a 2.4 GHZ Q6600.

The code has to be amended that it starts 4 threads on a quarter of the array each thread,

and then merges the partial Arrays to one again, in order make use of the 4 cores.

Then you’ll see about 800 Milliseconds, may be a little bit more.

This merge code is the same, that you need for GPU-sorting to merge, when the number of

elements would exceed the GPU-Memory. This would be in a range of 100 Million elements

I would expect a tremendous decrease

of the sorting speed for this published solution here anyway,

and this not by the CPU-GPU memory transfer only, but just by the applied algorithm.

I have to deal with 1-2 billion of elements.

Maybe the zerocopy functionality of the 1.3 CC would help the GPU-Sort re the memory transfer.

I have spent enough money on my 8800GTS 2.5 years ago.

I also have spent lot’s of time to achieve a significant speed gain on sorting

with the GPU, at least iternally in the GPU, not taken the memory transfer into account,

which blast any speed gain anyway.

Maybe Nvidia donates me a 260GTX with 1.7 GB with CC 1.3.

I still have to pay a mortage off and a child to grow up, but I have an idea,

basing on the Radix sort example in the SDK. Maybe in some months,

Here in the Forum there is an example for radix sorting (without shared device mem)

and it sorts 8 Million numbers in 0.02 sec.

I cannot find it anymore, but I have copied the kernel.

The Array has to be padded with high values to a

2^x… The merge is outstanding here as well,

global static void BitonicSortLarge (int * values, int j, int k){

unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

int ixj = i^j;

	if ((ixj)>i){

	if ((i&k)==0 && values[i]>values[ixj]) {


		int temp = values[i];

		values[i] = values[ixj];

		values[ixj] = temp;


	if ((i&k)!=0 && values[i]<values[ixj]){

		// exchange(i,ixj);

		int temp = values[i];

		values[i] = values[ixj];

		values[ixj] = temp;




In my CPU code the SOURCEARRAY is filled by rand().

A simply descending list in sorted within 700 Milliseconds to an ascending list.

(just replace if(1) by if(0) at the bottom.)

Other datatypes can be used by setting the #define ELEMTYPE

to the desired dataype.

The results in the sorted array is compared against an ARRAY sorted by the stdlib qsort, which takes about the

double/triple time, since qsort ( propably ) uses recursion and surely function pointers on the custom compare function.

If results are equal, success is printed, otherwise failure.

I would consider a GPU-solution ias interesting, when it sorts 24 Million

elements in less than 10 Milliseconds, best just 2-3.

Since i have to sort Billion of elements (they also have attributes, not just sort keys)

I need access to multiple disks via CUDA, I wonder how to achieve this via DMA,

right on the Disk blocks.
nrquicksort.txt (2.89 KB)

Thanks for your generous share. 16-core & 32-core CPUs will be released in near future, so continuing your work on CPU sorting may be better than considering GPU sorting. I cannot believe in the existence of such GPU algorithms able to sort millions samples in milliseconds.

You underestimate the speed of NVidia GPUs for sorting. The Satish et al sorting code included in the CUDA 2.2 SDK in the smokeParticles project sorts two million value+keys in 16 milliseconds on a 280GTX. You can compile and test that on your own machine today. Though ZeZe’s wish of sorting 24 M key+data values in 2 milliseconds is likely a while off for anybody. Even a trivial no-op stream through 24M values one time would use a memory bandwidth of 24M*8bytes/2ms = 96GB/sec!

On the CPU side, algorithms do indeed evolve to handle more cores. The Chhugani results are really, really impressive. Chhugani’s method doesn’t scale linearly with cores, so for sorting 2M keys on the CPU, 4 cores is only about 3.1 times faster than 1 core. When we see 8 and 16 core CPUs they will likely sort very quickly, but by that time we’ll also have another couple generations of GPUs which will be even faster.

Where CPUs can be faster than GPUs for sorting is in smaller sorts, less than 1M points or so. It’s really interesting comparing the best known CPU sorting with the best known GPU sorting. Look at the Chhugani paper and Satish papers, they’re both quite readable.

Now if only Intel would let Chhugani release his CPU code…

I dont’ understand this all here anymore.

The duration of sorting increases more than the number of elements to be sorted.

Double number of elements, 3 times longer, best case.

As far as I have understood, the OP claimed that his (our sort) can sort 24 Million numbers (int4, float, whatever)

in about 800 Milliseconds and titled this efficient.

I have asked if the OP has seriously uttered such a ratio, as it is reflects from his publication. No answer came.

If this really would be the case, it would not exceed the efficiency of a Q6600 and especially not of a Intel 920,

and 8, 16 core successors, it is not worth to publish such efforts, which are scientifically excellent, but in terms of efficiency

somewhat rediculous.

The cvnguyen asked about pieces of code, which demonstrates such speed for a CPU solution, and I posted it,

admitting that it is not really mine.

And now?.

Of course the (Nvidia) GPU is good for very much more, I don’t know how, but may be in some month.

Maybe other know as well. I pointed to a posting here in the forum which is quite promising.

(the bitonic kernel for 8 Million rows, it just does not use shared memory)

My advice is, Nvidias GPU will make it, against all odds , against CELL and against ATI,

but the folks must try harder. They should research more an publish less.

It gets a bit confusing since there’s a couple conversations happening at once. I was replying to cvnguyen’s comment that he could not believe a GPU could sort millions of items in milliseconds, so I pointed him at the reference and code where it’s currently done.

As for your scaling comment, sorting integers or floats does not need to take O(N log N) time. There are many sort methods which are O(N) by using the ability to compare keys numerically (not just by pairwise comparison). Satish’s code is a radix sort which is O(N) so his sort speed becomes linear with N after about N=1M.

Chhugani’s CPU sort is a very efficient O(N log N) sort. You can see both of these scaling effects in Satish’s paper, the Satish sort levels off at a constant rate of keys/second, Chhugani’s decays. From Satish’s paper, his GPU sort will key-only sort at a rate of about 180 keys/sec for any N from 1 to 100 M. Chhugani’s CPU sort will sort at a rate of almost exactly the same 180 keys/sec at N=1M, but it slows to about 130M/sec for N=100M due to that log N effect.

Indeed, the OP’s code is not the fastest sort available. But it’s great that he’s sharing, especially since he took a different strategy than many other sorts.

It is indeed worthwhile to publish work, even if it doesn’t beat other code, just because it’s interesting and useful as a comparison. The OP didn’t claim he had the fastest GPU sort, he was just sharing his code and paper for one sorting approach.

If you’re interested in other GPU sorts, there’s a nice alternative radix sorter in GPU Gems 3. CUDPPalso has a radix sorter, though it will likely be replaced with the Satish version (Mark Harris is part of both implmentations.) Satish et al also implemented an N log N merge sort on the GPU, but it is slower than the radix sort for all N. (They’re very very similar speeds for small N like 50K though.)

Michael Garland has some great papers, too. He’s a coauthor of the Satish et al paper.

You can see already that fast GPU sorts are useful and practical, the fast raytracing BVH code uses fast sorting as a major component of their algorithm.

The GPU Gems 3 sort was implemented for broad phase object collision detection. The smokeParticles SDK example uses it for particle neighbor binning.