ModernGPU: New fast sort and book-length online tutorial radix sort, sparse matrix, tons of docs

Hi all. I’ve been putting together a big book-length online GPU computing tutorial. It’s at

The content is very scan/reduction-centric, and the tutorials are built around some advanced case studies. I find that most GPU docs are pretty deficient in covering the very interesting and essential theory of inter-thread communication. Internalizing scan helps you tackle a much wider variety of problems without losing your mind in details.

The first case study is a radix sort I wrote:

The code sorts 1.31 billion 32-bit keys per second on a GTX 570. (!) It runs 30-40% faster than the current B40C development code (655 in the SVN). It’s 60% faster than thrust::sort for 32-bit keys. I don’t know of any pure radix sort that is faster. The code was written to be easily documented, and I cover pretty much every line from base principles, and show radix sort as an exercise in designing sophisticated hierarchical scans. It follows pretty closely the Satish/Harris/Garland radix sort. I note where my implementation deviates significantly from Satish, as that is a good reference. The code is relatively simple (compared to what is possible) and uses no templates . It’s a basic count-histogram-sort implementation with five kernels total (the histogram has kernels for upsweep, reduce, downsweep).

The second case study is a cleaned-up version of sparse matrix code I did months ago. The documentation is not ready but I did get some benchmarks up:

This code is extremely useful at illustrating segmented scan, the tool for balancing jagged work loads over many parallel processors. I’ll get detailed docs there soon.

I also have a blog ( ) if anyone would like to leave a comment. Additionally, I’ve got a ‘Mailbag’ section where I’ll make an article if you send in an interesting question. I’ve got one of these online already, which shows how to adapt the global scan operator to find the index of the maximum element in a large sequence.

The companion code to the tutorials is open-sourced with a BSD license:

The tutorial and code targets the Fermi architecture specifically.

I’m still adding in VS10 and GNU makefiles for some projects, but the really core stuff (like the sort library) should be ready to use. 64bit cubins are included in the repository, but 64bit users should take caution as this branch is not well tested.


Pretty cool stuff!!!
Thanks for making the code available.

Nice Content in your website. I like it.

This site kills all the other cuda books i’ve read. Awesome!

SeanB should sell it as a self published ebook. ;) It doesn’t prevent him from posting his content on his web site either.

Agreed on the ebook. The parts I’ve read so far are excellently written and fantastic to bridge the gap between “I’ve written some simple CUDA programs” and “I want to solve hard problems with CUDA”. If I ever get to teach a scientific computing course, I hope this will be available in book form so I can assign it to my students!

Thanks for the nice comments. I’ve still got a lot more content that needs to be posted. Maybe after Kepler and Southern Islands are released I can re-optimize the code and look at doing an ebook (by then I should have enough material). This stuff doesn’t have much shelf life and we’re already nearing the end of Fermi’s lifespan. The next gen cards would be a nice opportunity for more formal publishing :)


I posted this up on Hacker News this morning and it’s been sitting near the top of the front page since then, so I’d bet you’re getting a fair number of hits today! :thumbup:

Link to the post, for those interested: