Host to Device memcpy overhead

Smokey · March 16, 2009, 12:36am

Hey all,

I know this topic has been covered countless numbers of times - but (at least for me) this is such a big issue I just have to bring it up again…

I have a bunch of kernels, all of which run inside of a real time processing loop (processing video frames) - so performance is critical.
ALL of these kernels, have of my kernels spend more time sending memory to the GPU (cuMemcpyHtoD) than they do actually running the kernel (generally by a factor of 2-4 times longer uploading, than executing).

When profiling my kernels I start a timer (actually 2, my own performance timer, and CUDA’s timing events - both are generally within 50us of each other) just before I start setting up the kernel parameters (cuSetParam*), and stopping the timer afters I synchronize (for profiling purposes) after launching the kernel (cuCtxSynchronize after cuLaunchGrid).

Surprisingly, it’s worse if I use pinned memory w/ async memory transfers (the overhead involved in creating the pinned memory, copying my already existing memory over to it, and then deleting it afterwards is … tremendous).

In most cases I have at least 2kb of data I need to upload (and in most cases, download back to the host as well) - which generally consists of 2 and 3 dimensional vertex data, in extreme cases I have 10kb+ (normals/texture coordinates, 3x4/4x4 matrices, etc) - which generally takes 250us-1000ms to upload from pageable memory - and generally a lot less for pinned memory (but freeing it later has such a performance hit that it’s not worth it).

I’m currently working on minimizing how much I have to upload (eg: only uploading vertices when they change, etc) - but even with these optimizations, uploading even just a few bytes of data takes 50-100us (even worse on laptops/etc with bandwidth issues, in some cases 500us+ for 64 bytes) - compared to kernels which take anywhere between 50us and 400us.

I just wanted to know if everyone else is experiencing similar issues, or maybe some smarter ways of using pinned memory?

Smokey · March 16, 2009, 12:59am

I should probably note that the reason pinned memory doesn’t really give me the kind of performance increases most see, is I don’t actually have any work to do ‘while’ it’s uploading - so once I get to cuLaunchKernel, it just waits for the copies to finish anyway - essentially giving the same speed as pageable copies.

computerulz · March 17, 2009, 12:53am

I think more then anything your problem is with the extremely small size of the data chunks you are using (KB). With mine I copy a few MB, and when I switch between pinned and not pinned, I see an obvious increase in transfer speed, of about 40%, however, since I also had the allocation and freeing of the pinned memory in the method and did that every call, the overall execution time of the method was essentially identical for the memory I use. So I think for pinned memory to show any sort of benefit, the amount of data would have to be greater then 10MB. My situation doesn’t require pinned memory in the end as the relative time of memcpy/execution is about 0.2%, but it was interesting to test.

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1994	May 23, 2011
Kernel Copy vs. cudaMemcpy CUDA Programming and Performance	1	5430	January 19, 2014
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2816	June 24, 2016
Memory copy improvement ? CUDA Programming and Performance	6	3072	April 25, 2012
How to transfer massive data efficiently? CUDA Programming and Performance	5	5796	April 16, 2015
Does cudaMemcpyAsync require pinned memory? CUDA Programming and Performance	2	5166	November 25, 2015
Jamming lots of little things into a big thing, quickly. We have lots of images. We need them in a s CUDA Programming and Performance	25	3157	November 16, 2010
Highly varying copy throughput from/to pinned to/from pageable memory CUDA Programming and Performance cuda	9	1196	July 10, 2020
transfer from pageable host memory to page-locked host memory? CUDA Programming and Performance	3	1050	June 1, 2012
Performance effects of pinned memory CUDA Programming and Performance	5	1002	January 27, 2011

Host to Device memcpy overhead

Related topics