A question about Radix Sort

It seems that the fastest sorting algorithm in CUDA is the radix sort. However, after break down the runtime of different components in Radix Sort, I found, the cudppScanDispatch consumes 99% of the time. Why? Isn’t cudppScanDispatch supposed to be with complexity O(n)?

Thanks!