Why is cuda Synchronize() taking so long even with batched GPU→CPU copies, and how can I profile what in the stream queue is causing the delay?

I’m seeing unexpectedly long cudaStreamSynchronize() times even though I only do GPU->CPU (DtoH) copies once per batch (not per inference), and I’m trying to understand what is actually blocking the stream. My assumption was that batching the memcpy_dtoh_async would reduce sync overhead, but the sync still takes a lot of time, which makes me think the delay is not the copy call itself but some earlier queued work in the same stream (kernels, implicit sync points, or previous memcpy operations) that only gets exposed when I call cudaStreamSynchronize(). Since CUDA streams execute strictly in order, I suspect a backlog in the queue is accumulating and only becomes visible at sync time, but I’m not sure how to pinpoint which operation is responsible. Is there a reliable way (e.g., CUDA profiler / Nsight Systems / event timing) to break down a stream and identify exactly which kernel or memcpy in the queue is causing the stall before the sync?