Friday screenshot: HotSort v2 sorting algorithm

I had some major spilling issues with CUDA 7.x and my HotSort sorting library.

CUDA 7.5 didn’t resolve the spills and I couldn’t wait any longer… it was unfortunately time for a heavy rewrite. :(

Some of the work was done months ago but it took the last few weeks to get the higher-level kernels completed.

I also managed to generalize the implementation so it can run on architectures that aren’t as programmer-friendly as CUDA multiprocessors.

The generalized version now compiles cleanly on 7.x and the performance seems to match the original algorithm.

Here’s a snapshot showing unsigned 64-bit key throughput:

Gotta love those Maxwell SMMs. :)

Good Work!

Sorting on parallel architectures is a harder task.

Will this be an open-source project? If so can you provide a link?

If you have a test case I can benchmark on a Titan X.

Thanks!

Unfortunately, it’s not open-source but I figured it would be nice to show some plots.

The NVIDIA ModernGPU (merge) and CUB (radix) implementations are state of the art but HotSort dominates below ~2-4m 64-bit keys.

I’ve put it off for years, but at some point I’ll replace the higher level merging kernels with something better and the performance curve should move higher and flatten out when sorting 10’s of millions of keys.

It’s funny how just a few years ago there was serious debate over whether GPUs could efficiently beat CPU sorting algorithms. :)

This showed up today on hgpu;

https://docs.google.com/viewer?url=http://arxiv.org/pdf/1511.03404v1

Not really that great of a paper, but if I am interpreting their charts correctly your HotSort is far superior to the listed implementations.

I have always found that thrust::sort(), which I believe is a radix sort(for primitives) is very fast. How does HotSort compare to thrust::sort() for a large array(2^29) of 32 bit floats?

I just skimmed that paper a few minutes ago when it showed up on @hgpu!

I don’t have 32-bit results for HotSort v2 right now but the old results on the PIXEL I/O site should wind up being nearly the same.

HotSort crushes everyone (!) below 2-4m 64-bit keys. It has a similar performance curve for 32-bit keys.

But past that point, radix — especially with 32-bit keys — will continue to rise as HotSort continues to droop.

I think CUB/Thrust will reach a ludicrous 2B/1B 32-bit keys/sec on a GTX Titan.

When sorting a gigantic 2GB array of 32-bit keys, CUB and Thrust radix are going to win big… for now. :)

pretty simple radix sort implementation [1] sorts 32-bit keys at 90 MKeys/sec [2] on the single core of i7-4770. With a few more optimizations it should be 110 MKeys/sec. So, the 18-core monster should process ~1 GKeys/sec

[1] https://github.com/Bulat-Ziganshin/MT-LZ/blob/49cfbb7ddfa0306be081125c1421669921814d82/RadixSort.cpp#L96
[2] https://github.com/Bulat-Ziganshin/MT-LZ/blob/49cfbb7ddfa0306be081125c1421669921814d82/results.txt#L329

Please enlighten me: What are practical use cases of these fast radix sorts?

not sure what exactly you mean. radix sort may be preferred to other sorts in cases where it’s faster. 32-64 bit key is large enough for many usecases and additional 32-64 bit data can carry index of corresponding array element, so sorting will produce a sort of DB table index.

my personal usecases are Shindler Transform and LZ search. In both cases, we have 3-8 byte key, and 2-4 byte data. For LZ search those data bytes is just index in the dictionary, so sorting allows us to cluster together dictionary elements with the same first few bytes.

By “use case” I meant actual production code that includes this sort of radix sort (using 32-bit or 64-bit keys). I have 25 years of software optimization experience, and have worked with customers from diverse industries, yet have never come across any software that needed this. So I am curious where this would show up. Business analytics?

I will have to look up the Shindler Transform as I have never come across the term.

[Later:] I guess that should be “Schindler Transform”. Used in compression software from what I can tell from a quick internet search.

@njuffa,

Sorting Makes The World Go 'Round!

The flavor of sorting algorithm might change, but both databases and graphics pipelines can use key sorts.

Here’s a classic paper: A Sorting Classification of Parallel Rendering

The use case @BulatZ describes above sounds similar to how I use HotSort.

Packing interesting attributes into a 64-bit key (or larger) and sorting them into contiguous buckets/bins/spans for consumption by downstream kernels is a very performant and clean idiom.

One observation I’ve made is that if you can somehow control your sorting stage so that it remains within its performance “sweet spot” then the cost of sorting becomes negligible when compared to other kernel load balancing schemes.

For example, sorting 16K 64-bit keys can be completed on a single Maxwell multiprocessor in 83 usecs.

Only one gmem load and one gmem store per key. It’s nearly free. :)

I have never worked with databases, so that may explain my lack of exposure to fast sorting outside the classroom.

I have worked on the implementation of an OpenGL pipeline. I do not recall any sorting in that, though. Well, I guess there is implicit sorting by means of the Z-buffer, but that is specialized hardware, no software required. What kind of graphics operations benefit from fast sorting? Tile-based rendering or ray-tracing maybe?

Yes, TBDR pipelines sort/bin. Ray-tracing too but I haven’t looked at it too closely (zillions of experts out there could pipe in).

I’ve also been seeing sorting show up in SpGEMM GPU algorithms: [1] [2].

just a few examples:
http://www.eecs.berkeley.edu/~jshun/semisort.pdf

Thanks for the links. Looks like I was not far off with my guess about business analytics (which are typically based on in-memory databases).

@BulatZ,

This one too: GPU Multisplit