To efficiently find N largest items within much bigger vectors

I have lots of vectors with 20M float in each, already in GPU. How to efficiently find indexes for the N biggest values in a long vector, when N is much smaller than the size of the vector? For example finding indexes for something like 2000 largest values within a vector of 20M.

Fully sorting those 20M item vectors would seem wasteful(?), likely in execution time as well as surely in the storage needed for the result, when I would just need about 1/10,000 of the indexes.

I’m not aware of any convenient library implementations (which is what I would want for this) that do this. If you do a bit of google searching you will probably conclude that for very small N, a method of using successive max-finding reductions may be quicker than sorting. For large values of N its doubtful that you can do much better than sorting. There may be some no-mans-land in between where there might be some benefit to a custom implementation, but I’m not aware of any. a google search on “top N sort CUDA” returns some interesting material, though.

Fast sorting is a fairly difficult problem to code on the GPU. However its quite important so there are very good implementations available. As a result of this, the no-mans-land window is effectively smaller on the large-N side.

The fastest algorithm I know of that is robustly implemented on GPUs and has survived a lot of use and QA is radix sort. Unfortunately, from what I know of it, radix sort does not readily lend itself to early exit for this, as it sorts from least significant bit to most significant bit. So a sorting algorithm with an early exit might have to start out with an “inferior” algorithm, further crippling early-exit benefits.

1 Like