Doing multiple sorting using CUDA

Hi all, I’m an excel VBA user.
The problem I need to solve is that there are > 10000 sets of data, and each set of data contains around 10 records. I need to sort the data in each set.
Can I use CUDA to speed up the 10000 sortings in parallel?
I need to do it in sequential for loops in excel now…

Thanks all,

What you’re describing is an “embarrassingly parallel” problem that will run well on CUDA or pretty much any other parallel platform.

Have each thread load a set of elements, sort them, and then store them back. No inter-thread communication is required.

Note that 10 elements can be sorted in only 29 “compare-exchange” operations.

As an example, if you had a GTX Titan this would be very few instructions per core (assuming your elements are 32-bits):

(~10,000 threads / 2688 cores) * (10 loads + 29 compxchgs + 10 stores) < 200 ops

This is helpful, thanks, Allanmac!
The elements are single precision floats.
However, I’m only using a Stone Age 250, and only got 128 cores…
Anyways, I need to do this repeatedly and pretty sure it’ll be much better than running in excel…
Thanks again.
Rgds, Ed

Your 128 core GTS 250 will still be incredibly fast! A CPU would also be blazingly fast as it’s really not that many operations. :)

My bet is that exchanging the data with the excel sheet will take in the order of tens of seconds, whereas the actual CUDA computation is done in less than a second.

Thanks all. It will shorten the computation time from 10 months to 3 days and bring my project feasible again.
But I need to pick up the C again which I have put down for 10 years already…

Hi all,
If I were to build a pc for the above project, which is embarrassingly parallel, should I go for :

  1. a Titan or
  2. 2 X 660 ti?
    As I found that 2 x 660 ti got similar number of core but just cost half of a Titan…


If all you’re doing is sorting 10^4+ buckets of ~10 elements then I doubt you need a GPU of that caliber.

For example, a dusty old GT 240 (GT200 w/96 cores) can sort 10,000+ buckets of 1024 32-bit keys at over 1000 Mkeys/sec. This means sorting 32K buckets of 1K keys takes 32 milliseconds.

A GTX 680 is ~10x faster (~3.25 ms.).

Your described problem is far simpler than this though as no merging is required. I can’t estimate the performance but it will be silly fast. Sorting 10 or 20 elements just isn’t that much work and you’ll probably be able to run at some high percentage of device bandwidth (GTS250 = 70 GB/sec.) since your CUDA kernel will basically resemble a memcopy()… which really means you will be running at PCIe bus speed if you’re round-tripping data between the CPU and GPU.

Unless you have some sort of hard real time requirement then I would skip thinking about TITANs until it works on your GTS 250 or a regular CPU. :)

Thanks Allanmac,
I over-simplified what I need to do. Actually I need to run this kind of sorting around several million times plus some simple matrix multiplication in each iteration.
And it’s an ongoing project that I may need to run twice a week, that’s why I’m thinking to get a new gpu…
Rgds, Ed