cudaMemcpy2DAsync long latency

Yaro · June 30, 2013, 2:27pm

Hello,

I just wrote a program, where I use a kernel a lot of times in a for loop. I copy the necessary data from host to device in every iteration. Everything is done in the same stream (not the default one).

Profiling my program I recognised, that the call cudaMemcpy2DAsync normally takes a really long latency, but it is not the case in the first two iterations.
How does it come? Can I reduce the latency without using multiple streams?

The profiler outcome is attached

Yaro · June 30, 2013, 3:38pm

I thought, my cpu is not fast enough to issue all calls in time, so I made the kernel much longer. The effect was, that the latencies became even bigger!

Topic		Replies	Views
CudaMemcpyAsync wait long time to launch CUDA Programming and Performance cuda , kernel	8	2263	April 11, 2022
cudaMemcpy latency increases when using 1 device with 2 processes CUDA Programming and Performance cuda	7	133	March 3, 2025
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1257	February 4, 2021
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	272	August 22, 2024
cudaHostAlloc memory initial time CUDA Programming and Performance	0	385	August 19, 2018
Questions about "cudaMemcpyAsync" Legacy PGI Compilers	1	2402	November 18, 2011
cudaMemcpyAsync blocks and has long Runtime API duration CUDA Programming and Performance	0	466	December 10, 2016
cudaMemcpyAsync slower than cudaMemcpy? CUDA Programming and Performance	1	3125	March 10, 2009
cudaMemcpyAsync Func Used too long time. CUDA Programming and Performance	5	2507	July 15, 2019
Much slower async memcpy in a separate stream than in stream 0 CUDA Programming and Performance	4	5260	July 23, 2015

cudaMemcpy2DAsync long latency

Related topics