cub::DeviceRadixSort::sortKeys concurrent to kernel?

I’m performing a sort of a device vector using sortKeys.
I’m preparing the input vector (the vector to be sorted), by calling some other kernel, before the sort operation. I found out that I have to add a cudaMemcpy (or cudaDeviceSyncronize()), between the kernel and the sort, to get correct results.
Why? does cub::sortKeys run on a different stream?

not in my experience

You can confirm the streams used by the various activities with a profiler.