I had some major spilling issues with CUDA 7.x and my HotSort sorting library.
CUDA 7.5 didn’t resolve the spills and I couldn’t wait any longer… it was unfortunately time for a heavy rewrite. :(
Some of the work was done months ago but it took the last few weeks to get the higher-level kernels completed.
I also managed to generalize the implementation so it can run on architectures that aren’t as programmer-friendly as CUDA multiprocessors.
The generalized version now compiles cleanly on 7.x and the performance seems to match the original algorithm.
Here’s a snapshot showing unsigned 64-bit key throughput:
Gotta love those Maxwell SMMs. :)