Partial sort an array in CUDA or Thrust?

I am working on a problem that looks like this:

typedef pair<float,int> pt;
int N = 5000; // matrix size NxN
// element at (row,col) = (i,j) is at position i * N + j
vector<pt> data;
// data.size() = N * (N- 1) / 2, representing all pairs of objects
sort(data.begin(), data.end());

// pair objects greedily
for (int i = 0; i < data.size(); i++)
{
    int x = data[i].second / N, y = data[i].second % N;
    if (selected[x] || selected[y]) continue;

    selected[x] = true;
    selected[y] = true;
    // ...
}

I found that in my data, the first 5% of the array contribute to ~98% of the pairs. So the best method is to partially sort and find the first 5%, then find the remaining unpaired objects and work on them separately.

So, is there any library that support partial sort_by_key in CUDA? thrust::sort_by_key always sort the whole array, which is not needed in my case.

An alternative is to find K-th smallest element (like std::nth_element), is there a similar function in any GPU library?

Thanks!

Edit: actually I don’t even need to find K largest elements, I only need to find the K-th largest element. Because I will process the rest on CPU anyway

You could sort chunks of size k and merge them in parallel, always keeping only the k smallest elements of the resulting range.

I’ve tried your suggestion and it’s actually slower.

int K = N / 10;
for (int t = 0; t < 10; t++) {
    int starter  = t * K;
    int ender = min(starter + K, N);            
    thrust::sort(thrust::device, d_ptr + starter, d_ptr + ender);
}

With N = 25 * 10^6, d_ptr* as (equivalent of) std::pair<float,int>, sorting the whole array takes 15ms. While sorting by chunks alone takes ~20ms, so I didn’t add merging them.