find the top Nth biggest of the elements

The data structure is two array: key and value.
How fast can cuda sort be?
What is the fast implementation?
Compared to qsort in C, how much advantage can it have?
thank you.

harbor

It depends what kinds of keys you’re sorting (and the values as well). If you’re sorting integer or float keys, the performance is excellent if you use Thrust or CUDPP (many times faster than CPU). If you’re sorting some custom types, the performance will be somewhat reduced because the libraries will switch over to a less efficient, more general algorithm.

How about radixsort in smokeparticle?thank you

Check out the Thrust and CUDPP libraries. Thrust implements both RadixSort and MergeSort (used for general data types), while CUDPP just implements the RadixSort.

Also, read this paper:

http://mgarland.org/files/papers/gpusort-ipdps09.pdf

thaks for the reply.

I used the thrust for test. It took 0.05s to sort about 18k float number while 0.005s when use qsort in C.

The GPU only begins to out pace CPUs in sort when the item count gets into the hundreds of thousands or millions. Your problem is probably two orders of magnitude too small for the GPU.

Are you including the time to transfer the data to/from the GPU? Excluding transfers I sorted 18k random floats in 0.000791329 seconds with Thrust on a GeForce GTX 280.

Here’s some more info on Thrust sorting performance.

Can the thrust sort by key directly perform on device memory? I met some problem when doing this. And I am debugging.

This might have some improvment.

But eventually the result should copy back to cpu memory.

Using reduction multi times actually can find the top Nth element, which is what I need.

Are you sure about that? A standard partial reduction will give you a set of candidate values containing the global maximum, but there is no guarantee that any of the other candidate values in that set have any order or relation to the global maximum. You need to include some ordering and/or merging somewhere to ensure that.

I was just thinking about this the other day as well.
Im still a novice in cuda but here are some random thoughts…

  • Like avidday said, if your data is not on the device, the CPU might be faster and sort at O(kn). (radix/bucket radix)
  • for my app, I was thinking of a modified radix sort that tossed out keys in the lower order bits. (I have not built anything yet. )
  • also for my app, I am able to zip the key and value together in a 32 bits word. I put the key in the higher order bits and sort with that.
  • Sorting the entire array just to get the top N values seems wasteful. (I guess it depends how big N is.)
  • If N is small (1,2,3,4) it might be faster to just get Max() then save that value and set the original to zero and find Max() again and again.

I sure know about that. I said multiple times reduction. After found the maximum, I set the data to a really small value and run reduction again and find the second biggest one.

Maybe I am not clearly state my problem. The key-value data are currently on device memory. I need to sort them and got the Nth biggest index,and copy the index to cpu and do other stuff.

Actually I don’t need the step of data copy to device memory.

How do you Excluding transfers? I just do as the

thrust::sorting::radix_sort_by_key(magnd, magnd + N, listd);

You are quite right. If the sort time is merely less. I would rather have it sorted though.