Asymptotic cost CUB sort

Good Afternoon,

I have a question regarding sorting performance with CUB library I hope someone knows the answer to. Does anyone know the lower-bound performance of CUB sort ? Either comparison or non-comparison based sort (e.g. radix sort). I know that serial based comparison sorts run lower-bound of n*lg(n) but assume that parallelized CUDA CUB sorts should be lower - maybe n/p lg(n) where p is number of parallel elements ?

Thanks,