Host to Device memcpy overhead

Hey all,

I know this topic has been covered countless numbers of times - but (at least for me) this is such a big issue I just have to bring it up again…

I have a bunch of kernels, all of which run inside of a real time processing loop (processing video frames) - so performance is critical.
ALL of these kernels, have of my kernels spend more time sending memory to the GPU (cuMemcpyHtoD) than they do actually running the kernel (generally by a factor of 2-4 times longer uploading, than executing).

When profiling my kernels I start a timer (actually 2, my own performance timer, and CUDA’s timing events - both are generally within 50us of each other) just before I start setting up the kernel parameters (cuSetParam*), and stopping the timer afters I synchronize (for profiling purposes) after launching the kernel (cuCtxSynchronize after cuLaunchGrid).

Surprisingly, it’s worse if I use pinned memory w/ async memory transfers (the overhead involved in creating the pinned memory, copying my already existing memory over to it, and then deleting it afterwards is … tremendous).

In most cases I have at least 2kb of data I need to upload (and in most cases, download back to the host as well) - which generally consists of 2 and 3 dimensional vertex data, in extreme cases I have 10kb+ (normals/texture coordinates, 3x4/4x4 matrices, etc) - which generally takes 250us-1000ms to upload from pageable memory - and generally a lot less for pinned memory (but freeing it later has such a performance hit that it’s not worth it).

I’m currently working on minimizing how much I have to upload (eg: only uploading vertices when they change, etc) - but even with these optimizations, uploading even just a few bytes of data takes 50-100us (even worse on laptops/etc with bandwidth issues, in some cases 500us+ for 64 bytes) - compared to kernels which take anywhere between 50us and 400us.

I just wanted to know if everyone else is experiencing similar issues, or maybe some smarter ways of using pinned memory?

I should probably note that the reason pinned memory doesn’t really give me the kind of performance increases most see, is I don’t actually have any work to do ‘while’ it’s uploading - so once I get to cuLaunchKernel, it just waits for the copies to finish anyway - essentially giving the same speed as pageable copies.

I think more then anything your problem is with the extremely small size of the data chunks you are using (KB). With mine I copy a few MB, and when I switch between pinned and not pinned, I see an obvious increase in transfer speed, of about 40%, however, since I also had the allocation and freeing of the pinned memory in the method and did that every call, the overall execution time of the method was essentially identical for the memory I use. So I think for pinned memory to show any sort of benefit, the amount of data would have to be greater then 10MB. My situation doesn’t require pinned memory in the end as the relative time of memcpy/execution is about 0.2%, but it was interesting to test.