 # Median / Selection / Ith order statistic algorithm

Has anyone created or attempted to create a CUDA selection algorithm (select ith smallest element) or know anyone else who did? I’m considering creating a divide and conquer approach of choosing a random pivot, partition, and recurse on the partition where the ith largest will be.

``````Median(a, n, i)	 // select ith element from a (size n)

{

while (n > 1)

{

pivot = RandomPivot()

swap(a[pivot_index], a[n - 1])

left_size = CountLess <<<512>>> (a, pivot)	  // # elements <= pivot

right_size = n - left_size - 1

if (i < left_size)

{

CopyLess <<<512>>> (temp, a, n, pivot)	  // copy elements <= pivot to temp

swap(a, temp)

n = left_size

}

else if (i == left_size)

{

return pivot

}

else

{

CopyMore <<<512>>> (temp, a, n, pivot)	  // copy elements > pivot to temp

swap(a, temp)

i =  i - (left_size + 1)

n = right_size

}

}

}
``````

It seems a median algorithm isn’t easy and has all the challenges of fine grained || problems:

*frequent synchronization - is there anyway to avoid using multiple kernel launches? Needed due to lack of synchronization between thread blocks?

*Difficult to do an in-place partition on a, like we would for single processor? Thread divergence will be terrible.

*What’s best way to do CopyLess & CopyMore? They can all write to global result buffer, using atomics, but that probably would be slow. Or each thread can compute it’s own output, then merge them after knowing the sizes of the other threads’ result buffers.

On the surface, there seems to be an efficiency, ||ism trade-off (i.e. we can try to do clever things to reduce work, but also reduce ||ism, or we can do things less efficiently, but with more ||ism).

Take a look at CUDPP and Thrust. They both feature highly optimized sorting algorithms, which you can use to easily solve these problems. Thrust for the easiest way, CUDPP if you need to change some of the low-level stuff (though I don’t see why you would).

try looking at scan and vector collapse