I want to find the max K elements from a array,K is less than 0.1% of the array size.
I know two possibilities:
(1) reduce for K times
(2) sort and get the top K

Anybody can tell me which one is better for CUDA? and are there any other efficient algorithms for this?

It depends on the size of K and N. If K is 1 or 2, then running reduction K times will probably win. There may even be a clever way to do two at once by a kind of double-elimination tournament ladder.

A full sort would be overkill for any K or N.

A good first-pass idea is to load/stream say 2000 values through a block and find the top K of all those values, all in shared memory so you do as little global memory access as possible.
Then all blocks merge their top K and you run one more pass to find the best K out of THOSE. This second pass would have a dramatically lower N so it’d be pretty fast. (Possibly you do this step on the CPU). In fact for K=1 you could use atomics to do the voting… cheap since it’s only one value per block in contention.

The general case for identifying the top K values in a list is surprisingly an O(N) algorithm. Basically you proceed like a quicksort. However after you partition into two halves, you don’t need to subpartition both halves… you only need to visit one. The other one will either be all too small (and can be discarded) or all large (and therefore all the values are in the top K and don’t need further discrimination.)