Fastest sorting algorithm on GPU currently

Hello community,

I understand that sorting is a primitive algorithm on GPU. There are many different implementations, e.g., radix sorting,
merge sorting, etc.

With many different sorting algorithm, I am not quite sure which one does the best performance. My question is what is
the fastest sorting algorithm on GPU currently.

Any pointer would be appreciated.

Thank you.

If you’re sorting 32-bit or 64-bit keys there are benchmarks for both free and non-free implementations:

1 Like

thrust::sort() using device pointers is very fast for large arrays of primitive types (radix sort). About 100x faster than 1 single threaded 4.5 GHz using stl:sort() for an array of 32-bit floats with 2^27 elements.

currently thrust just incorporates cub radix sort, providing simpler interface. simpler in both means. so if you mad for speed, you can use moderngpu for merge sort, cub for radix sort. commercial libs may be faster - otherwise you will hardly buy it :)

Thank you for the above comments. I will benchmark those sorting algorithms and see which one
does the best performance in various scenarios.

i had published benchmarking program for CUB: https://github.com/Bulat-Ziganshin/Compression-Research/tree/master/app_radix_sort

And here’s a Thrust sort snippet: https://gist.github.com/allanmac/ce559b9b2357a29661b2

I have found that the choice of GPU and the configuration make quite a big difference in performance, at least it does when using the thrust library.

For a test case of an randomly generated array of 268,435,456 elements (32-bit float) using CUDA 8 Windows 7 x64;

All times include the memory copies from host to device (pinned), the calling of sort, and the copy back of the result from device to host.

In this case the total memory copy time is about 171 ms for the total memory operations, so any time above that is the total sorting time on 1 GPU.

For a GTX Titan X using TCC driver;

Num Elements= 268435456
GPU timing: 0.286 seconds.
CPU timing: 26.385 seconds.

Error= 0

Max Difference= 0

So the total sorting time GTX Titan X using thrust::sort() using pointers is (286-171)= 115 ms for the sort.

Now if I use the GTX 1080 which is connected to the display and uses the WDDM driver;

Num Elements= 268435456
GPU timing: 1.241 seconds.
CPU timing: 26.355 seconds.

Error= 0

Max Difference= 0

So the total sorting time GTX 1080 using thrust::sort() using pointers is (1241-171)= 1070 ms for the sort.

So using the Titan X is about 9.3 times faster for a large sort than using the GTX 1080, which is disappointing but illustrates how the system configuration can make a huge difference.

Source code (not elegant but you will get the idea);
http://pastebin.com/s30yg2NS

Just ran that code on my laptop using a GTX 980M without any memory clock boost and somehow the GTX 980M is much faster than the GTX 1080 even though both are connected to the display and use the WDDM driver;

Num Elements= 268435456
GPU timing: 0.483 seconds.
CPU timing: 33.751 seconds.

Error= 0

Max Difference= 0

So the GTX 980M is over 2x faster than the GTX 1080? Weird…

sorting libs are not yet optimized for pascal gpus, so they may take wrong computation paths. for example optimize for sm>=5.3 as for sm 5.3 (low-end mobile gpu). you may report that to developers and i’m sure that they will make 1080 faster than any old titans. one problem is what thrust just employs cub code for radix sort, and may be not the lastest version. so i suggest you to install cub and go benchmark/bugreport it

I think I will test via your code next, thanks.

I thought that maybe the performance difference between the Pascal GTX 1080 and the Maxwell GTX Titan X for that large 1GB array may be related to the memory bandwidth issue (the bug I filed which NIVIDA fixed then ‘unfixed’).

But even if I reduce the size of the array by half (staying well below the 1 GB level) there still is a huge difference in performance.

All recorded time include host-device and device-host copies in addition to the device sorting time. For this 2^27 size looking at the results from profiling in nvprof the total memory copy overhead is ~84 ms for either GPU.

GTX Titan X;

Num Elements= 134217728
GPU timing: 0.143 seconds.
CPU timing: 12.817 seconds.

Error= 0

Max Difference= 0

GTX 1080;

Num Elements= 134217728
GPU timing: 0.595 seconds.
CPU timing: 12.786 seconds.

Error= 0

Max Difference= 0

Thrust does indeed seem to be calling cub radix sort under the hood.

Profiling data for GTX Titan X;

GPU timing: 0.143 seconds.
CPU timing: 12.835 seconds.

Error= 0

Max Difference= 0
==2816== Profiling application: ConsoleApplication1.exe
==2816== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
8.54170s  44.029ms                    -               -         -         -         -  512.00MB  11.356GB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
8.58613s  1.8074ms            (600 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [230]
8.58795s  12.192us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [235]
8.58796s  4.8849ms            (600 1 1)       (128 1 1)        90  5.5781KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [250]
8.59285s  1.8032ms            (600 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [261]
8.59465s  12.257us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [266]
8.59467s  4.8628ms            (600 1 1)       (128 1 1)        90  5.5781KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [281]
8.59953s  1.8049ms            (600 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [292]
8.60134s  12.160us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [297]
8.60136s  4.8709ms            (600 1 1)       (128 1 1)        90  5.5781KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [312]
8.60623s  1.8014ms            (960 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [324]
8.60803s  29.889us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [329]
8.60806s  7.1746ms            (960 1 1)        (64 1 1)       127  4.6719KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [344]
8.61524s  1.8035ms            (960 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [355]
8.61705s  29.378us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [360]
8.61708s  7.1182ms            (960 1 1)        (64 1 1)       127  4.6719KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [375]
8.62420s  1.7974ms            (960 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [386]
8.62600s  29.601us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [391]
8.62603s  6.7676ms            (960 1 1)        (64 1 1)       127  4.6719KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [406]
8.63280s  1.8026ms            (960 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [417]
8.63461s  29.633us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [422]
8.63464s  4.9317ms            (960 1 1)        (64 1 1)       127  4.6719KB        0B         -           -  GeForce GTX TIT         1         7  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [437]
8.63957s  3.8688ms                    -               -         -         -         -  512.00MB  129.24GB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoD]
8.64367s  40.772ms                    -               -         -         -         -  512.00MB  12.263GB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

Profiling output for GTX 1080;

GPU timing: 0.599 seconds.
CPU timing: 12.786 seconds.

Error= 0

Max Difference= 0
==5320== Profiling application: ConsoleApplication1.exe
==5320== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
10.3883s  44.699ms                    -               -         -         -         -  512.00MB  11.186GB/s  GeForce GTX 108         2        14  [CUDA memcpy HtoD]
10.8542s  2.1894ms            (500 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [230]
10.8564s  6.1440us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [235]
10.8564s  5.7950ms            (500 1 1)       (128 1 1)        87  5.5781KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [250]
10.8622s  2.1679ms            (500 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [261]
10.8644s  6.1440us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [266]
10.8644s  5.8851ms            (500 1 1)       (128 1 1)        87  5.5781KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [281]
10.8703s  2.1832ms            (500 1 1)        (64 1 1)        96  2.0625KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullTyp
e*, int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [292]
10.8724s  6.1450us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [297]
10.8725s  5.9036ms            (500 1 1)       (128 1 1)        87  5.5781KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*,
 thrust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxAltDownsweepPolicy*, int*, int, int, bool, bool, thr
ust::system::cuda::detail::cub_::GridEvenShare<int*>) [312]
10.8784s  2.1740ms            (800 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [324]
10.8805s  18.432us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [329]
10.8806s  9.3638ms            (800 1 1)        (64 1 1)       126  4.6719KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [344]
10.8899s  2.1730ms            (800 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [355]
10.8921s  18.433us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [360]
10.8921s  9.3013ms            (800 1 1)        (64 1 1)       126  4.6719KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [375]
10.9014s  2.1873ms            (800 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [386]
10.9036s  19.457us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [391]
10.9036s  8.6142ms            (800 1 1)        (64 1 1)       126  4.6719KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [406]
10.9122s  2.1720ms            (800 1 1)        (64 1 1)        72  4.1250KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortUpsweepKernel<thrust::system::cuda::de
tail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxUpsweepPolicy, bool=0, float, int>(thrust::system::cuda::detail::cub_::NullType*, int*, thrust::system::cuda::detail::cub_::NullType*,
 int, int, bool, thrust::system::cuda::detail::cub_::GridEvenShare<thrust::system::cuda::detail::cub_::NullType*>) [417]
10.9144s  19.456us              (1 1 1)      (1024 1 1)        50      176B        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::RadixSortScanBinsKernel<thrust::system::cuda::detail:
:cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxScanPolicy, int>(float*, int) [422]
10.9144s  5.5697ms            (800 1 1)        (64 1 1)       126  4.6719KB        0B         -           -  GeForce GTX 108         2        14  void thrust::system::cuda::detail::cub_::DeviceRadixSortDownsweepKernel<thrust::system::cuda::
detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy, bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>(thrust::system::cuda::detail::cub_::NullType*, th
rust::system::cuda::detail::cub_::NullType, int*, int, thrust::system::cuda::detail::cub_::DeviceRadixSortDispatch<bool=0, float, thrust::system::cuda::detail::cub_::NullType, int>::PtxDownsweepPolicy*, int*, int, int, bool, bool, thrust::s
ystem::cuda::detail::cub_::GridEvenShare<int*>) [437]
10.9201s  4.5242ms                    -               -         -         -         -  512.00MB  110.52GB/s  GeForce GTX 108         2        14  [CUDA memcpy DtoD]
10.9441s  42.670ms                    -               -         -         -         -  512.00MB  11.718GB/s  GeForce GTX 108         2        14  [CUDA memcpy DtoH]

If you look in detail to the profiling output you can see that the launch configurations are different for the various steps of the sort. This supports Bulat’s theory that maybe thrust (or CUB) has not been optimized for Pascal.

Wait a second I see that issue is that there is a large unexplained gap which makes up most the performance difference for the GTX 1080.

Take a look at lines 10-11 in each profiling output. For the GTX 1080 there is a 460 ms gap between the first host-device copy and the start of the sort, while for the GTX Titan X the gap is only as long as the memory copy.

The GTX 980M also does not have this gap, so it seems unlikely this is a WDDM issue alone.

Any idea of what is causing this issue?

When you run the application on the GTX 1080, is it compiled for the GTX 1080 architecture?

Yes, building against CUDA 8.0RC, Visual Studio 2012 with the following code generation;

compute_52,sm_52;compute_61,sm_61;

Using the verbose compile output and I can see it compiles for both sm_52 and sm_61.

Also using the most recent driver as well.

take into account that radix sort use twice the memory for sort, it is hidden in thrust API

and run first at least two sorting operations, measure only last one to avoid counting any startup delays

@LongY

Performance can vary greatly if its for a specific use case.

For example, there are implementations of “unstable” sorts for lower precision (ex uint16) data that is several times faster than thrust.

1 Like

Ok, I updated the test to average over 8 iterations (host-device copy of original array, sort, device-host copy of result) and it does appear that the GTX 1080 does have some extra ‘initialization’ time.

Changed to this;

const int num_iter=8;
	cout<<"\nnumber of sorting iterations= "<<num_iter<<'\n';

	wTimerRes = 0;
	init = InitMMTimer(wTimerRes);
	startTime = timeGetTime();

	for(int i=0;i<num_iter;i++){
	
		err=cudaMemcpy(D_Arr,H_Arr,num_bytes,_HTD);
		if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

		thrust::device_ptr<float> D_p=thrust::device_pointer_cast(D_Arr);

		thrust::sort(D_p,D_p+num_elem);

		err=cudaMemcpy(H_gpu_result,D_Arr,num_bytes,_DTH);
		if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

		err=cudaDeviceSynchronize();
		if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
	}
	

	endTime = timeGetTime();
	GPU_time=endTime-startTime;
	cout<<"GPU timing average: "<<(double(GPU_time)/1000.0)/double(num_iter)<<" seconds including all memory copies both directions.\n";

GTX Titan X;

Num Elements= 134217728

Using device number 0 which is a GeForce GTX TITAN X

number of sorting iterations= 8
GPU timing average: 0.143375 seconds including all memory copies both directions.
CPU timing: 12.798 seconds.

Error= 0

Max Difference= 0

GTX 1080;

Num Elements= 134217728

Using device number 1 which is a GeForce GTX 1080

number of sorting iterations= 8
GPU timing average: 0.254875 seconds including all memory copies both directions.
CPU timing: 12.817 seconds.

Error= 0

Max Difference= 0

since each time the memory copies take about 84 ms then the GTX Titan X average sorting time for this size is (143-84)= 59 ms while the GTX 1080 is (254-84)= 170 ms.

While the GTX 1080 time on average is faster than the original times I posted the GTX Titan X is still over 2x faster for the sort.

What is weird is that the GTX 980M does not have that initial 460 ms of overhead on the first call even though it is also connected to the display, but the GTX 1080 does consistently have that large overhead.

I do get that the average time for the GTX 1080 is being influenced by the ‘first’ call, I just do not understand why Maxwell GPUs do not have that overhead.

This paper compared three sorting algorithm (Thurst, CUB, and MGPU). It is published this year.
http://arxiv.org/pdf/1601.03144v1.pdf

I am also curious why there is an extra 460ms of overhead for the GTX 1080.

the more correct approach will be to not average all 8 runs, but drop first run time at all. or measure it separately

I would be in need of a such a high-performance uint16 sort. Can you point me to a library or paper?