I need to get the median of a *global float, usually about 1000 in length. While I have started my own kernel to do just that, a generic sorting kernel could make this task a cake walk. I have tried many combinations of web searches to see if there was work already done in the public domain. There was a Cuda implementation of QuickSort, the divide and conquer method. It was done by 2 people at a Swedish university.

They published their results in this paper [post=“0”]GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors[/post] .

This document only shows pseudo-code. This implementation could be improved using atomics, which were not available at the time. I can tell right now that I could not do an OpenCL implementation, if all I had was this paper and source for a version of quicksort in Java. Does anyone know if Cuda can do recursion, because OpenCL cannot? Can anyone shed any light on OpenCL options?

I will continue on with my median kernel, which is vulnerable when there is a tie at the median. It is very parallel, but doing way too many compares. I need something in place, and then I can hopefully retire it someday.