cudaMemcpyAsync Func Used too long time.


In my program,I use to up load 5 image data(about 33.1MB/frame) continuity,from host to device.I use cudaMemcpyAsync with correct CUDA stream.
I test my program’s perform by Visual Profiler.I found that,the func execute time of cudaMemcpyAsync may be very long.The reason seriously affects my program’s performance.
I’am sure I use pinned memory as host memory.and I found that,some time the execute time of cudaMemcpyAsync is abnormal long,some time is very short.

Code about upload one frame:

bool NvcCUDAMemoryCopy(ENvcCUDAMemcpyDirection eDirection,unsigned __int64 ui64Stream, void* pSrcMemory,void*pDstMemory, unsigned __int64 ui64MemorySize,bool bSync)
cudaMemcpyKind emKind = cudaMemcpyDefault;
cudaError_t emRet = cudaSuccess;

switch (eDirection)
case keNvcCUDAMemcpyDirection_Host_To_Device:
	emKind = cudaMemcpyHostToDevice;
case keNvcCUDAMemcpyDirection_Devcie_To_Host:
	emKind = cudaMemcpyDeviceToHost;
case keNvcCUDAMemcpyDirection_Devcie_To_Devcie:
	emKind = cudaMemcpyDeviceToDevice;
	assert(false); return false;

if (bSync)
	emRet = cudaMemcpy(pDstMemory, pSrcMemory, ui64MemorySize, emKind);
	emRet = cudaMemcpyAsync(pDstMemory, pSrcMemory, ui64MemorySize, emKind, (cudaStream_t)ui64Stream);

assert(emRet == cudaSuccess);

return (emRet == cudaSuccess);


Visual Profiler:a record in Visual Profiler for a abnormal execute time of cudaMemcpyAsync

start 55.86506s
end 55.87769s
duration 12.61291ms
memory copy
description Memcpy HtoD [async]
start 55.8778s
end 55.88082s
duration 3.01465ms
size 33.181MB
Throughput 11.007GB/s
Stream stream 631
Memory type
Source pinned
Destination Device

My environment:

Win10 CUDA 10.1 GTX1080Ti c++


cudaMemcpyTooLongTime.rar (1.58 MB)

That’s pretty much the throughput I would expect.

Hello Robert_Crovella:

I know that the Throughput’speed is ok.
But the execute time of cudaMemcpyAsync in host is too long.
start 55.86506s
end 55.87769s
duration 12.61291ms

cudaMemcpyAsync is an Async func ,so 12ms may be too long.

cudaMemcpyAsync, like any other stream-based activity in CUDA, is issued into a stream. The function will not begin to execute until all previous activity in that stream has completed. The profiler reports the time that the function was issued into the stream, and the time that it completes, as its duration.

But this is a meaningless judge of performance. If the cudaMemcpyAsync function is waiting for previous activity to complete, that has nothing to do with its performance (i.e. duration).

There’s simply not enough information in the little snippet you have posted to make any judgement. But from what I can tell, it got issued, it may have waited for a while, and eventually it executed. When it executed, it achieved a throughput of ~11GB/s which is quite reasonable for a PCIE gen3 link.

There isn’t anything that looks out of the ordinary to me. You might just as well complain about the “performance” of cudaDeviceSynchronize issued after a kernel launch. Its duration will include the duration of the kernel execution. That is meaningless from a “performance” standpoint.

Hello Robert_Crovella:

Thank you for your reply.

Can I send e-mail to you ,to show you what happened ?because I don’t know how to attach the picture in forums.

I have the Visual Profiler file to show how important the “performance” of cudaDeviceSynchronize. is.

To post a picture using this site directly, edit your post, and in the edit toolbar at the top of the edit window, select the button that looks like a chart picture - the button just to the left of the code button </>. Then follow the directions (e.g. select the picture to upload, etc.)

Another approach is to manually encode the link to a picture using text like this:

[ img]http_link_to_picture[ /img]

(get rid of the spaces in the brackets above)

You may also want to read this answer: