Hi all,

We’d like to share with everyone our radix sorting implementation, SRTS Radix Sort. You can find the code, information, and performance results for a variety of GPUs and input types/sizes on our Google Code project site, Back40Computing.

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our GPU sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the Giga-keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). Our results demonstrate a range of 2x-4x speedup over the current Thrust and CUDPP sorting implementations, and we operate on keys of any C/C++ numeric type. Satellite values are optional, and can be any arbitrary payload structure (within reason).

The radix sorting method currently seems to be the most amenable for parallelization on high-bandwith architectures, and the fastest sorting implementations for numeric keys on modern processor architectures are radix sorting designs. In the classic apples-to-oranges comparison, the state-of-the-art sorting rates are (at the moment):

Intel radix sort on quad-core Core i7: **240M 32-bit keys / second** (paper)

Intel radix sort on 32-core Knight’s Ferry MIC (Larrabee successor): **560M 32-bit keys / sec** (paper)

Our SRTS radix sort on GTX480: **1,005M 32-bit keys/sec**

And, unlike the former two, you can walk in off the street and buy a GTX480 and download our source code and evaluate them yourself.

In addition to performance, one of our design goals for this project is flexibility. We’ve designed our implementation to adapt itself and perform well on all generations and configurations of programmable NVIDIA GPUs, and for a wide variety of input types. See the project site for a (growing) compilation of performance results on a variety of devices and input types.

Although it’s slightly outdated since our refactoring to accommodate the Fermi architecture, we request that if you use/reference/benchmark this code, please cite our Technical Report:

```
@TechReport{ Merrill:Sorting:2010,
author = "Duane Merrill and Andrew Grimshaw",
title = "Revisiting Sorting for GPGPU Stream Architectures",
year = "2010",
institution = "University of Virginia, Department of Computer Science",
address = "Charlottesville, VA, USA",
number = "CS2010-03"
}
```

Cheers!

Duane