GTC 2020 S21572
Presenters: Andrey Adinets,NVIDIA
We’ll present a faster implementation of the least-significant-digit radix sort. Decoupled look-back is used to reduce the number of memory operations per partition pass from 3N to 2N. A faster partition implementation inside the thread block is used. For 32-bit keys, we use four partition passes, with 8 bits per digit. On V100 sorting 64 million random UInt32 keys, our implementation achieves the speed of 16 Gkey/s, which is more than 2x faster than cub::DeviceRadixSort.
Watch this session
Join in the conversation below.