K largest elements for vector of length N


In my application I would like to find the K largest elements for every vector of length N in a collection of vectors being the input:

[font=“Courier New”] float d_input[8][8][32][1024];[/font]

For every vector of length 1024 I would like to find the 64 largest values and their indices as output of the algorithm:

[font=“Courier New”] typedef struct {

    int index;

    float value;

} element_t;

element_t d_output[8][8][32][64];[/font]

As a starting point, I thought it could be a good idea to use 64 threads per vector in a block to work in parallel to find these values:

[codebox]dim3 grid_reduction(8*8, 32);

reduction_device<<< grid_reduction, 64 >>>( d_data, d_reduced );

global void reduction_device( float *input, element_t *output )


int index = (blockIdx.y * gridDim.x + blockIdx.x )* 1024 + threadIdx.x;

input += index;

output += index;

shared element_t vector[1024];

//load vector from global memory

for( int i = 0; i < 16; i++ )


    vector[i*64].index= i*64;

    vector[i*64].value = input[i*64];



//search for 64 largest values


//result is in first 64 locations

*output = vector[threadIdx.x];


The big question now is the algorithm. I would appreciate some help, hints or pointers to papers.

Kind regards,


You probably need to do a radix sort on each vector and take the “top” 64 values from the output. An incomplete parallel reduction would be the obvious suggestion, but that won’t work in this case. There is a radix sort implementation in the CUDA 2.3 SDK, there is also an implementation in the thrust template library, although I doubt either will work without modification given the data structure you are using.


Thanks for the hint. From a first walk through the code, it looks quite complex and comprehensive; I’ll have to find the relevant parts, as I only want to use on block per vector. I hoped to find a single device function that covers the sorting of a relatively small vector with one block of threads. What is a good choice of the number of parallel threads per block working on the sort?

The bitonic sort example uses as many threads as there are elements. However, this would not be suitable for me because it exceeds the maximum number of threads per block (512).

Kind regards,

8832 blocks each with 64 threads

All you need to do is scan 64 values at a time, compare it with next 64 values, update the result (preferrably in smem), and keep going like that until u reach 1024. Now each block will hve the 64 largest value (although in a non-sorted form). Sort them out in CPU…

We are doing this for our LGA algo…

That approach cannot be guaranteed to yield the 64 maximum values in the input data.

OK. That’s what I thought I would like to do.

I am afraid I do not quite understand how this works. Would every thread scan all 1024 values, or would every thread scan only 1024/64=16 values?

At some point I thought that every thread works for a start on 1024/64=16 values and sorts these 16 values locally. Then, by some magic, all 64 sorted buckets are merged.


For what you want to do, it can’t work. What sarnath has suggested is the latter (effectively an incomplete reduction). It will give you a subset contain the input data maximum, but not necessarily the set of 64 greatest values

I don’t see how you can avoid doing a full sort of the input vector, whether than winds up being a sort of subsets (for example heapifying each subset) and then doing a merge, or a complete sort.

OK, then I’ll need to find a suitable method to sort the 1024 values using 64 or so threads.

I just discovered that using

shared element_t vector[1024];

in the device function with

typedef struct {

int index;

float value;

} element_t;

seems to reduce the number of blocks being executed on one multiprocessor significantly in comparison to using only

shared float vector[1024];

It looks like only half of them can be executed. If I am not mistaken, any Tesla C1060 MP has 16kB shared memory. One of my vectors comsumes 8kB, so only two blocks per MP at any time.


Another idea that you could try which avoids to full sort the arrays:

  1. For each 1024-array: Make it a heap
  2. For each heap execute 64 times a remove-minimum operation

Creating the heaps can be done in parallel on the 1024 elements array either by copying first into shared memory or by working directly in global memory. For the second step, work directly in global memory and let each thread work on a different array. The second step does not need shared memory, so you can use a large number of threads to hide the global memory latency.

There’s a library for sorting in CUDPP and Thrust.


Yes… Each thread would scan 16 values… (not necessary 16 consecutive ones. If u r looking at coalescing, each value would be separated by 64)

You are talking about the non-coalesced implementation… BUt thats fine… Each of them will have a maxium and eventually you wil have 64 maximum values… I dont understand why Avidday is not accepting this theory. THe only thing is the 64-maximum values will NOT be in sorted order. You can do that fast on the CPU side…

There is no need to sort the 16 values locally in each thread… you just need to find the maximum amongst them… Now, combine all the maximums from each thread – you get 64 top maximums in your array… Thats all. No buckets, Nothing…

If there was a value in the rest of 1024 that is greater than the 64 you have got by this method – he must have been the maximum in his bunch of 16 and hence he would find his way to this bunch of 64 – This proof is for avidday

Forget my previous post… Its all junk shallow thinking… Sorry If I had wasted your time

Your idea was my first thought as well, but it is a bit subtle to realise that a local maxima in n subsets of the data isn’t necessarily an element of the set of n maximum values in the data set, and similarly that a given subset might contain more than one element of the set of n maximum values. Which is why and incomplete reduction without any form of sorting can not be guaranteed to work in this case.

Absolutely Avidday. Thanks for pointing out. My stupidity and laziness astounds me all the time…

I would say that having just the output array sorted/heaped is enough.
I would start setting all output items to 0x80000000, lowest negative int or 0, if unsigned.
Than performing sorted insertions in the output, performing one “scan” of the input, would be enough.
I think this is going to perform better than full sorting all the input array. Am I missing something?


I dont understand your algorithm. Heaping is the best way as I understand… I dont understand how you avoid comparisons… Can you elaborate?


I have seen the suggestions based on the idea of sorting the 1024 elements input vectors and getting the first 64. IMHO this is overkilling, since you are sorting 980 960 elements that you will trash into the basket. Furthermore sorting algorithms have usually the constrain to work in place (I do not know about the referenced ones in particular) - and this is not the case.

My idea can elaborate in this way:

  • initialize “output” to 0x80000000 - that’s an heap containing all the same values. Actually you can build the starting output heap out of the first 64 elements of “input”… That’s probably faster, and for sure clearer!

  • iterate over the remaining input[i]s elements:

— if input[i] is in the output heap (larger than the smallest) remove the smallest and insert input[i] in its place

— reorder the heap

The latter 2 operations can be differently combined for best optimization: probably letting the smallest be the heap radix fasten thing up. So the heap will be sorted bottom up, starting from the 64th element. The age of the latter idea is about 30 seconds - come out just writing the post.

However this will lead to about 1024log(64) operations, that is quite better than 1024log(1024) - exactly as better as 6 over 10. This as theory.