GTC 2020: A Faster Radix Sort Implementation

GTC 2020 S21572
Presenters: Andrey Adinets,NVIDIA
Abstract
We’ll present a faster implementation of the least-significant-digit radix sort. Decoupled look-back is used to reduce the number of memory operations per partition pass from 3N to 2N. A faster partition implementation inside the thread block is used. For 32-bit keys, we use four partition passes, with 8 bits per digit. On V100 sorting 64 million random UInt32 keys, our implementation achieves the speed of 16 Gkey/s, which is more than 2x faster than cub::DeviceRadixSort.

Watch this session
Join in the conversation below.

Nice content, thanks for sharing! I’m thinking about implementing this in compute shader,
But I found this presentation a little bit confusing…, is there any source code or paper about this algorithm?