bucket sort on GPU Any ideas?

The paper looks useful… i already found this other one.
Anyway i thought of just assigning some fixed memory amount to each triangle, but this is impossible because, in the worst case where one single triangle fills the screen, you have all 0-area triangles and one single one-million-elements bucket…
so any arbitrary number both in threads and memory won’t work, i think.

Luckily i don’t need to fill or sort these buckets (in that case the one-million thread would kill the performances anyway) but only to set their delimiters for each shared memory block for future use.
Then i spawn exactly one thread per bucket elements and i can use shared memory to lookup the thread index in the bucket.

Anyway also if O(n) the scan algorithm is pretty slow…

What I did is in serial code, but it’s very easy adapted to parallel, the main ideia is to actually know what should go in each bucket, you should make a way to make this automatically (of course) based on the amount of things (in my case numbers, in your case triangles) you want to divide and scan, then is just a mater of actually doing it. I remember I did something like this a very long time ago in MPI, the basic ideia (if I’m not mistaken) was:

  1. Create a global placeholder for all the buckets
  2. Create n amount of threads (n is arbitrary)
  3. Each thread would scan part of the whole number-vector-thingy and place it on the correct bucket (watching out for memory overwrite)
  4. When all threads were finished, you would have all your numbers-vector-thingy correcly placed in each bucket
  5. Then you sort.

Hope this helps, I haven’t had the time to make this in CUDA, but is a good exercise.

Exactly. Prefix scanning tasks are important for many parallel algorithms: scanning routines allow threads to cooperatively determine the appropriate destinations for their output items. Bucket- and distribution-based sorting methods are a perfect example: keys can be processed in a data-parallel fashion for a given bucket, but cooperation is needed among threads so that each may relocate its keys.

As was mentioned earlier, bucket/distribution-sorting can serve as a stages/passes within radix sorting methods. We make extensive use prefix-scan for bucketing/partitioning in our CUDA sorting routines here at UVA (which we believe to be the fastest available for any microarchitecture): Revisiting Sorting for GPGPU Architectures.

Update – we’ve ported our radix sorting implementation to Fermi (with stellar results on GTX480: 1B 32-bit keys/s) and have released it to the public as well. Here’s the new forum thread regarding our SRTS Radix sorting implementation (results, where to download, etc.):




Thanks for your notice.

I will check it.

check the answer here: https://stackoverflow.com/questions/16781995/efficient-bucket-sort-on-gpu/64848155#64848155