I’m performing a sort of a device vector using sortKeys.
I’m preparing the input vector (the vector to be sorted), by calling some other kernel, before the sort operation. I found out that I have to add a cudaMemcpy (or cudaDeviceSyncronize()), between the kernel and the sort, to get correct results.
Why? does cub::sortKeys run on a different stream?
not in my experience
You can confirm the streams used by the various activities with a profiler.