Calling thrust::sort_by_key from kernel

I am working with driver API and need to sort an array of 64-bit integers, for which I opted to use Thrust. But how does it work internally and how to execute it in the kernel properly?

  1. Does it use dynamic parallelism, e.g invoking another kernel internally? Or should I distribute the computation somehow? If there is a kernel call inside the sort, are the block/grid params set somehow automatically or is it up to me to do it?

  2. Should it be invoked like this?

    thrust::sort_by_key(thrust::device, keys, keys + len, values);


The execution policy will determine how it is executed.

  1. If you use thrust::device, and you also have compiled appropriately to support dynamic parallelism, and you are running on a GPU that supports dynamic parallelism (cc3.5 or higher) then it will launch a kernel to perform the algorithm, via dynamic parallelism. If these conditions are not met, the behavior will be as in item 2 below.

  2. If you specify thrust::seq, the algorithm will be executed sequentially, entirely within the context of the launching (CUDA device) thread.

My guess would be that if you are only sorting 64 elements, that executing it using thrust::seq will be faster in any case. A dynamic parallelism kernel launch in this fashion has all the characteristics of dynamic parallelism kernel launches that you specify explicitly, including setup overhead and latency.

Here is a worked example and description of using thrust::sort from Thrust code:

If you study how it is called from the functor, you should be able to construct how to do it from ordinary CUDA device code.