Has anyone created or attempted to create a CUDA selection algorithm (select ith largest element) or know anyone else who did? I’m considering creating a divide and conquer approach of choosing a random pivot, partition, and recurse on the partition where the ith largest will be.

```
Median(a, n, i) // select ith element from a (size n)
{
while (n > 1)
{
pivot = RandomPivot()
swap(a[pivot_index], a[n - 1])
left_size = CountLess <<<512>>> (a, pivot) // # elements <= pivot
right_size = n - left_size - 1
if (i < left_size)
{
CopyLess <<<512>>> (temp, a, n, pivot) // copy elements <= pivot to temp
swap(a, temp)
n = left_size
}
else if (i == left_size)
{
return pivot
}
else
{
CopyMore <<<512>>> (temp, a, n, pivot) // copy elements > pivot to temp
swap(a, temp)
i = i - (left_size + 1)
n = right_size
}
}
}
```

It seems a median algorithm isn’t easy and has all the challenges of fine grained || problems:

*frequent synchronization - is there anyway to avoid using multiple kernel launches? Needed due to lack of synchronization between thread blocks?

*Difficult to do an in-place partition on a, like we would for single processor? Thread divergence will be terrible.

*What’s best way to do CopyLess & CopyMore? They can all write to global result buffer, using atomics, but that probably would be slow. Or each thread can compute it’s own output, then merge them after knowing the sizes of the other threads’ result buffers.

On the surface, there seems to be an efficiency, ||ism trade-off (i.e. we can try to do clever things to reduce work, but also reduce ||ism, or we can do things less efficiently, but with more ||ism).