Hi all. I’ve been putting together a big book-length online GPU computing tutorial. It’s at http://www.moderngpu.com/
The content is very scan/reduction-centric, and the tutorials are built around some advanced case studies. I find that most GPU docs are pretty deficient in covering the very interesting and essential theory of inter-thread communication. Internalizing scan helps you tackle a much wider variety of problems without losing your mind in details.
The first case study is a radix sort I wrote:
http://www.moderngpu.com/sort/mgpusort.html
The code sorts 1.31 billion 32-bit keys per second on a GTX 570. (!) It runs 30-40% faster than the current B40C development code (655 in the SVN). It’s 60% faster than thrust::sort for 32-bit keys. I don’t know of any pure radix sort that is faster. The code was written to be easily documented, and I cover pretty much every line from base principles, and show radix sort as an exercise in designing sophisticated hierarchical scans. It follows pretty closely the Satish/Harris/Garland radix sort. I note where my implementation deviates significantly from Satish, as that is a good reference. The code is relatively simple (compared to what is possible) and uses no templates . It’s a basic count-histogram-sort implementation with five kernels total (the histogram has kernels for upsweep, reduce, downsweep).
The second case study is a cleaned-up version of sparse matrix code I did months ago. The documentation is not ready but I did get some benchmarks up:
http://www.moderngpu.com/sparse/mgpusparse.html
This code is extremely useful at illustrating segmented scan, the tool for balancing jagged work loads over many parallel processors. I’ll get detailed docs there soon.
I also have a blog ( http://www.moderngpu.com/blog/ ) if anyone would like to leave a comment. Additionally, I’ve got a ‘Mailbag’ section where I’ll make an article if you send in an interesting question. I’ve got one of these online already, which shows how to adapt the global scan operator to find the index of the maximum element in a large sequence.
The companion code to the tutorials is open-sourced with a BSD license:
The tutorial and code targets the Fermi architecture specifically.
I’m still adding in VS10 and GNU makefiles for some projects, but the really core stuff (like the sort library) should be ready to use. 64bit cubins are included in the repository, but 64bit users should take caution as this branch is not well tested.
Thanks,
sean