This library is distributed as original code, you have to include the header “bbsort.cuh” in your project,
and call the bbsort() functions in your c/cpp/cu files.
You may also have to make some changes to customize the code for your applications. Following are some tips.
1, if you want to sort float, int, uint, short, ushort, you have only to change “bbsort.cu” near line 300
Please follow the images contained in sort_float-int-short-double_setups.
2, if you want to sort structures, you have make 4 changes in “bbsort.cu”, “bbsort.cuh” and “bbsort_kernel.cu”
Please follow the images contained in sort_structure_setups.
3, if you want to sort doubles, after having the change 1, you still to add arch=sm_13 to your custom build command.
And you also should have the cards with computing 1.3 capbility.
the gaussian_table.dat can be used for generating gaussian distributed list.
great work! but I have some problems when I try to compile the example for floats for example in VS2005:
1>bbsort.cu(197) : error C2065: ‘assignElementToSlicesNearlySortedD’ : identifier not found
1>bbsort.cu(200) : error C2065: ‘assignElementToSlicesD’ : identifier not found
1>bbsort.cu(241) : error C2065: ‘assignElementToBucketD’ : identifier not found
1>bbsort.cu(245) : error C2065: ‘bitonicSortD’ : identifier not found
what I’m doing wrong? if you find the solution please tell me, thanks a lot
Hey, this is terrific! Thanks so much for sharing and documenting so well.
Sorting always seems to be a boring problem, but in practice it’s quite fascinating with interesting techniques and bottlenecks, and many strategies to explore.
A couple questions:
You compare to GPUquicksort and GPUradixsort. But you just list these names without reference to what these are… are they the radix sort from CUDPP?
And have you compared your sorts against the work from Satish, Harris, and Bell? They’re getting about 200M key-only sorts/second on a GTX280 , about 140M key/value sorts/sec.
Their code is available, though tucked into the CUDA 2.2 SDK in the smokeParticles project.
I’d also love to see those speed graphs on a log scale… specially for smaller N like 5K, 10K, 50K, 100K. That’s a domain where radix sorting doesn’t do so well, and implementation overhead dominates. Yet 50K is a very common sorting size, especially for applications like real time SPH.
From the graphs, it’s clear that your sort has reached a speed equilibrium by N=4M, but we can’t see any performance details in the interesting range below that.
Also, for HUGE sorts of say 100M, do you start getting an O(NlogN) behavior or even O(N^2)? The initial linear bucketing steps feel like they’ll start losing effectiveness as more and more buckets get lopsided counts, and that may cause more inefficiency with even larger N. Your linear graphs don’t show any efficiency decay even up to N=24M so maybe this effect is effectively hidden.
It would be easier to see if the graphs were all showing keys/sec for different N, instead of total time for each N.
Hi,
I understand that you can sort 24 million numbers (floats4, int4) within a half second on a GeoForce 280 GTX.
(FIG 3., (B)
You mention on the same graph that the STL (standard Template library, Microsoft?) quicksort
would need 8 secs.
so 0.5 sec on GPU vs 8 sec on a CPU (1 core on a 2.4GHZ Intel Q6600)
Is this right?
The STL Quicksort probably uses recursion and function pointers.
This is expensive. If you were using a quicksort without this,
you would see that a sort of 24 Million numbers would take less than 2 seconds, on one core of a Q6600.
One might achieve 0.5 secs on 4 cores, or somewhat more.
So where is the efficiency, since a 280 GTX costs about the double of a Intel Q6600.
Err, don’t forget that GPUs have an order of magnitude more bandwidth than a CPU does. (285GTX ~= 160GB/sec, Core 2 ~= 10GB/sec, i7 ~=25GB/sec). There are lots of applications which use little compute, but are memory limited, and the GPU does great because of that bandwidth. And sorting is exactly such an application where GPUs destroy CPUs.
Read the Satish paper, the 280GTX can sort roughly twice as fast as a quad-core CPU using Chhugani’s optimized code.
What CPUs excel in are cases where single threaded speed is paramount (they have higher clocks, speculative execution, parallel operations, etc, everything they can do to get linear code to go fast). CPUs also have a very different memory cache structure which can be very effective for some applications like compression, where the compute needs more local state… compare the 16kb shared memory of a CUDA SM to a 2MB CPU L2 cache (or larger, counting L3), plus the CPU’s smooth and efficient buffering of global memory using that cache transparently.
The bandwidths you told are only for reading. Writes to global memory in CUDA are extremely slow, and sorting requires a lot of writes. BTW, 2x is not a good speed-up.
I’ve measured write bandwidth of ~70 GB/s compared to ~130 GB/s read bandwidth on a GTX 280 (theoretical peak is 141.7 GB/s). I wouldn’t call that “extremely slow”.
Sorting is an interesting application because it’s a case where having a few MB of cache at your disposal boosts your effective memory bandwidth by a significant multiple. With only 16KB of shared memory per SM, GPUs must operate on smaller chunks of data and therefore perform more memory transactions per data element.
So it’s not “extremely slow writes” that limits GPU sorting performance, it’s that GPUs have much smaller local stores than CPUs and therefore consume more memory bandwidth per data element. This eats away much of the GPU’s raw memory bandwidth advantage over CPU systems.
Thanks for your generous share. 16-core & 32-core CPUs will be released in near future, so continuing your work on CPU sorting may be better than considering GPU sorting. I cannot believe in the existence of such GPU algorithms able to sort millions samples in milliseconds.
You underestimate the speed of NVidia GPUs for sorting. The Satish et al sorting code included in the CUDA 2.2 SDK in the smokeParticles project sorts two million value+keys in 16 milliseconds on a 280GTX. You can compile and test that on your own machine today. Though ZeZe’s wish of sorting 24 M key+data values in 2 milliseconds is likely a while off for anybody. Even a trivial no-op stream through 24M values one time would use a memory bandwidth of 24M*8bytes/2ms = 96GB/sec!
On the CPU side, algorithms do indeed evolve to handle more cores. The Chhugani results are really, really impressive. Chhugani’s method doesn’t scale linearly with cores, so for sorting 2M keys on the CPU, 4 cores is only about 3.1 times faster than 1 core. When we see 8 and 16 core CPUs they will likely sort very quickly, but by that time we’ll also have another couple generations of GPUs which will be even faster.
Where CPUs can be faster than GPUs for sorting is in smaller sorts, less than 1M points or so. It’s really interesting comparing the best known CPU sorting with the best known GPU sorting. Look at the Chhugani paper and Satish papers, they’re both quite readable.
Now if only Intel would let Chhugani release his CPU code…
It gets a bit confusing since there’s a couple conversations happening at once. I was replying to cvnguyen’s comment that he could not believe a GPU could sort millions of items in milliseconds, so I pointed him at the reference and code where it’s currently done.
As for your scaling comment, sorting integers or floats does not need to take O(N log N) time. There are many sort methods which are O(N) by using the ability to compare keys numerically (not just by pairwise comparison). Satish’s code is a radix sort which is O(N) so his sort speed becomes linear with N after about N=1M.
Chhugani’s CPU sort is a very efficient O(N log N) sort. You can see both of these scaling effects in Satish’s paper, the Satish sort levels off at a constant rate of keys/second, Chhugani’s decays. From Satish’s paper, his GPU sort will key-only sort at a rate of about 180 keys/sec for any N from 1 to 100 M. Chhugani’s CPU sort will sort at a rate of almost exactly the same 180 keys/sec at N=1M, but it slows to about 130M/sec for N=100M due to that log N effect.
Indeed, the OP’s code is not the fastest sort available. But it’s great that he’s sharing, especially since he took a different strategy than many other sorts.
It is indeed worthwhile to publish work, even if it doesn’t beat other code, just because it’s interesting and useful as a comparison. The OP didn’t claim he had the fastest GPU sort, he was just sharing his code and paper for one sorting approach.
If you’re interested in other GPU sorts, there’s a nice alternative radix sorter in GPU Gems 3.CUDPPalso has a radix sorter, though it will likely be replaced with the Satish version (Mark Harris is part of both implmentations.) Satish et al also implemented an N log N merge sort on the GPU, but it is slower than the radix sort for all N. (They’re very very similar speeds for small N like 50K though.)
Michael Garland has some great papers, too. He’s a coauthor of the Satish et al paper.
You can see already that fast GPU sorts are useful and practical, the fast raytracing BVH code uses fast sorting as a major component of their algorithm.
The GPU Gems 3 sort was implemented for broad phase object collision detection. The smokeParticles SDK example uses it for particle neighbor binning.