The execution policy will determine how it is executed.
If you use thrust::device, and you also have compiled appropriately to support dynamic parallelism, and you are running on a GPU that supports dynamic parallelism (cc3.5 or higher) then it will launch a kernel to perform the algorithm, via dynamic parallelism. If these conditions are not met, the behavior will be as in item 2 below.
If you specify thrust::seq, the algorithm will be executed sequentially, entirely within the context of the launching (CUDA device) thread.
My guess would be that if you are only sorting 64 elements, that executing it using thrust::seq will be faster in any case. A dynamic parallelism kernel launch in this fashion has all the characteristics of dynamic parallelism kernel launches that you specify explicitly, including setup overhead and latency.
Here is a worked example and description of using thrust::sort from Thrust code:
If you study how it is called from the functor, you should be able to construct how to do it from ordinary CUDA device code.