cudaMemcpyAsync Func Used too long time.

sun_jianchuan · July 13, 2019, 6:21am

Hello NVIDIA:

In my program,I use to up load 5 image data(about 33.1MB/frame) continuity，from host to device.I use cudaMemcpyAsync with correct CUDA stream.
I test my program’s perform by Visual Profiler.I found that,the func execute time of cudaMemcpyAsync may be very long.The reason seriously affects my program’s performance.
I’am sure I use pinned memory as host memory.and I found that,some time the execute time of cudaMemcpyAsync is abnormal long,some time is very short.

Code about upload one frame:

bool NvcCUDAMemoryCopy(ENvcCUDAMemcpyDirection eDirection,unsigned __int64 ui64Stream, void* pSrcMemory,void*pDstMemory, unsigned __int64 ui64MemorySize,bool bSync)
{
cudaMemcpyKind emKind = cudaMemcpyDefault;
cudaError_t emRet = cudaSuccess;

switch (eDirection)
{
case keNvcCUDAMemcpyDirection_Host_To_Device:
	emKind = cudaMemcpyHostToDevice;
	break;
case keNvcCUDAMemcpyDirection_Devcie_To_Host:
	emKind = cudaMemcpyDeviceToHost;
	break;
case keNvcCUDAMemcpyDirection_Devcie_To_Devcie:
	emKind = cudaMemcpyDeviceToDevice;
	break;
default:
	assert(false); return false;
	break;
}

if (bSync)
	emRet = cudaMemcpy(pDstMemory, pSrcMemory, ui64MemorySize, emKind);
else
	emRet = cudaMemcpyAsync(pDstMemory, pSrcMemory, ui64MemorySize, emKind, (cudaStream_t)ui64Stream);

assert(emRet == cudaSuccess);

return (emRet == cudaSuccess);

}

Visual Profiler:a record in Visual Profiler for a abnormal execute time of cudaMemcpyAsync

cudaMemcpyAsync
start 55.86506s
end 55.87769s
duration 12.61291ms
memory copy
description Memcpy HtoD [async]
start 55.8778s
end 55.88082s
duration 3.01465ms
size 33.181MB
Throughput 11.007GB/s
Stream stream 631
Memory type
Source pinned
Destination Device

My environment：

Win10 CUDA 10.1 GTX1080Ti c++

Thankyou！！！！

cudaMemcpyTooLongTime.rar (1.58 MB)

Robert_Crovella · July 14, 2019, 4:06am

That’s pretty much the throughput I would expect.

sun_jianchuan · July 15, 2019, 3:25am

Hello Robert_Crovella：

I know that the Throughput’speed is ok.
But the execute time of cudaMemcpyAsync in host is too long.
cudaMemcpyAsync
start 55.86506s
end 55.87769s
duration 12.61291ms

cudaMemcpyAsync is an Async func ,so 12ms may be too long.

Robert_Crovella · July 15, 2019, 3:50am

cudaMemcpyAsync, like any other stream-based activity in CUDA, is issued into a stream. The function will not begin to execute until all previous activity in that stream has completed. The profiler reports the time that the function was issued into the stream, and the time that it completes, as its duration.

But this is a meaningless judge of performance. If the cudaMemcpyAsync function is waiting for previous activity to complete, that has nothing to do with its performance (i.e. duration).

There’s simply not enough information in the little snippet you have posted to make any judgement. But from what I can tell, it got issued, it may have waited for a while, and eventually it executed. When it executed, it achieved a throughput of ~11GB/s which is quite reasonable for a PCIE gen3 link.

There isn’t anything that looks out of the ordinary to me. You might just as well complain about the “performance” of cudaDeviceSynchronize issued after a kernel launch. Its duration will include the duration of the kernel execution. That is meaningless from a “performance” standpoint.

sun_jianchuan · July 15, 2019, 5:56am

Hello Robert_Crovella：

Thank you for your reply.

Can I send e-mail to you ,to show you what happened ？because I don’t know how to attach the picture in forums.

I have the Visual Profiler file to show how important the “performance” of cudaDeviceSynchronize. is.

Robert_Crovella · July 15, 2019, 2:09pm

To post a picture using this site directly, edit your post, and in the edit toolbar at the top of the edit window, select the button that looks like a chart picture - the button just to the left of the code button </>. Then follow the directions (e.g. select the picture to upload, etc.)

Another approach is to manually encode the link to a picture using text like this:

[ img]http_link_to_picture[ /img]

(get rid of the spaces in the brackets above)

You may also want to read this answer:

[url]cuda - Where is the boundary of start and end of CPU launch and GPU launch of Nvidia Profiling NVPROF? - Stack Overflow

Topic		Replies	Views
cudaMemcpyAsync CUDA Programming and Performance	10	20762	October 16, 2015
Inconsistent cudaMemcpy execution time CUDA Programming and Performance	4	1551	October 25, 2013
About CUDA CUDA Programming and Performance	2	4713	December 3, 2008
`cudaMemcpyHostToDevice` is very slow CUDA Programming and Performance	8	1984	December 14, 2018
Slow memory transfers CUDA Programming and Performance	7	1995	May 23, 2011
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1460	January 10, 2022
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3205	September 1, 2009
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	126	August 22, 2024
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1047	December 15, 2022
CudaMemcpyAsync wait long time to launch CUDA Programming and Performance cuda , kernel	8	2039	April 11, 2022

cudaMemcpyAsync Func Used too long time.

Related topics