Weird performance hickups with thrust

Anyone has idea why the following code might have performance hickups (from 0.3ms to 1.8ms)
Attached is the profiler output. I run the code bellow in a loop (in a simple test, without streams/threads and using thrust’s cache allocator) and from time to time, there are big hickups.

int runs = 1500;
for (int l = 0; l < runs; l++)
typedef thrust::device_vector::iterator OutputIterator;
OutputIterator output_end_iterator = thrust::copy_if(
thrust::make_counting_iterator((int)(sz * precentage)),
[=] host device (int index) { return (0 == index % 3);}); // copy_r_channel_from_rgb{});
thrust::sort(thrust::cuda::par(alloc), d_resvec.begin(), d_resvec.end());

I would assume the caching allocator has hiccups from time to time. Certainly at the first call, at least.

Please use proper code formatting. I also suggest providing a complete code.

Hi Robert,
It happens from time to time during the run… The hiccups also happens without the caching allocator. (5.4 KB)

Attached is an updated code with NVTX instrumentation and a portion of the NVVP output.
Any idea why it would behave like this?
I’m using CUDA 11.3 and GTX 1660super with Ubuntu 20

Eyal (5.7 KB)

Another run (without the event recording)…
How can the cudaGetDevice call take 187us, out of a total of 305us. Previous calls took ~120us, which is exactly 305us - 187us

And here the cudaGetLastError takes so much time

Hi, any idea what might be causing behaviour such as demonstrated in the last couple of screen shots?