I’m not aware of any convenient library implementations (which is what I would want for this) that do this. If you do a bit of google searching you will probably conclude that for very small N, a method of using successive max-finding reductions may be quicker than sorting. For large values of N its doubtful that you can do much better than sorting. There may be some no-mans-land in between where there might be some benefit to a custom implementation, but I’m not aware of any. a google search on “top N sort CUDA” returns some interesting material, though.
Fast sorting is a fairly difficult problem to code on the GPU. However its quite important so there are very good implementations available. As a result of this, the no-mans-land window is effectively smaller on the large-N side.
The fastest algorithm I know of that is robustly implemented on GPUs and has survived a lot of use and QA is radix sort. Unfortunately, from what I know of it, radix sort does not readily lend itself to early exit for this, as it sorts from least significant bit to most significant bit. So a sorting algorithm with an early exit might have to start out with an “inferior” algorithm, further crippling early-exit benefits.