Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay?

alaabdessaied · May 2, 2026, 1:25am

I’m seeing unexpectedly long cudaStreamSynchronize() times even though I only do GPU->CPU (DtoH) copies once per batch (not per inference), and I’m trying to understand what is actually blocking the stream. My assumption was that batching the memcpy_dtoh_async would reduce sync overhead, but the sync still takes a lot of time, which makes me think the delay is not the copy call itself but some earlier queued work in the same stream (kernels, implicit sync points, or previous memcpy operations) that only gets exposed when I call cudaStreamSynchronize(). Since CUDA streams execute strictly in order, I suspect a backlog in the queue is accumulating and only becomes visible at sync time, but I’m not sure how to pinpoint which operation is responsible. Is there a reliable way (e.g., CUDA profiler / Nsight Systems / event timing) to break down a stream and identify exactly which kernel or memcpy in the queue is causing the stall before the sync?

Topic		Replies	Views
Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay? CUDA Programming and Performance	1	20	May 4, 2026
stream synchronize problem CUDA Programming and Performance	2	794	August 28, 2017
Long wait in cudaStreamSynchronize Holoscan SDK	1	195	May 12, 2025
Cuda synchronisation is very long CUDA Programming and Performance cuda	1	270	March 15, 2024
cudaStreamSynchronize(a_stream) simpleStreams CUDA Programming and Performance	2	2452	December 2, 2010
How to reduce the overhead from cudaStreamSynchronize? CUDA Programming and Performance	2	681	June 10, 2021
cudaStreamSynchronize blocking for long time after kernels finished Jetson AGX Xavier cuda	3	559	October 18, 2021
cudaDeviceSyncrhonize takes too long CUDA Programming and Performance	1	834	September 9, 2020
Unable to understand the time unwanted time taken by cudaDeviceSynchronise() CUDA Programming and Performance tensorrt , cuda	1	412	April 12, 2022
Long overhead with cuStreamSynchronize with OMPI CUDA-GDB pcie	2	1072	August 20, 2021

Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay?

Related topics