GPU_QuickSort an OpenCL implementation somewhere?

I need to get the median of a *global float, usually about 1000 in length. While I have started my own kernel to do just that, a generic sorting kernel could make this task a cake walk. I have tried many combinations of web searches to see if there was work already done in the public domain. There was a Cuda implementation of QuickSort, the divide and conquer method. It was done by 2 people at a Swedish university.

They published their results in this paper [post=“0”]GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors[/post] .

This document only shows pseudo-code. This implementation could be improved using atomics, which were not available at the time. I can tell right now that I could not do an OpenCL implementation, if all I had was this paper and source for a version of quicksort in Java. Does anyone know if Cuda can do recursion, because OpenCL cannot? Can anyone shed any light on OpenCL options?

I will continue on with my median kernel, which is vulnerable when there is a tie at the median. It is very parallel, but doing way too many compares. I need something in place, and then I can hopefully retire it someday.

There is no recursion in CUDA either. The hardware really isn’t made for it.

Good! That means whatever the Cuda implementation does is NOT just totally unsuitable for OpenCL.

Why don’t you just use the bitonic merge sort algorithm? It is fast and it can be implemented on the GPU pretty easily. Plus it can sort smaller arrays (I would say up to 1024 elements, maybe even more) in a single block using the shared memory (which is extremely fast)

Thanks, I’ll look at that. I found this [post=“0”]Link[/post] on it. It is good for me, because the source is in Java. Unless, you know of a site that discusses bitonic merge sort in the context of a gpu?

I think you can actually find a bitonic sort demo in the NVIDIA CUDA SDK (although if my memory servers me right, the demo is limited only on one block (one work-group) plus it is terribly ineffective).

You can also find some details about various sorting algorithms and their GPU implementation here:…rting-kider.pdf

(it is a pretty old paper so no CUDA or OpenCL… )

I’ve also implemented Bitonic sort some time ago and while it is definitely not the most optimal implementation it is probably better than the one in the NVIDIA’s SDK , but the code is an uncommented mess so I don’t know whether it would help you but if you want I can send it to you for a reference

P.S. Your links do not work … they always point to

Did you ever get the code for GPU-Quicksort? I need it as well.

No, sorry. Just the pseudo stuff from that paper. I looked, but all links just point back to the paper.

You will find an example of sort in the Particle example of the CUDA SDK. You will naturaly have to port it to OpenCL, but it’s not so painful :shifty: