I know this topic has been covered countless numbers of times - but (at least for me) this is such a big issue I just have to bring it up again…
I have a bunch of kernels, all of which run inside of a real time processing loop (processing video frames) - so performance is critical.
ALL of these kernels, have of my kernels spend more time sending memory to the GPU (cuMemcpyHtoD) than they do actually running the kernel (generally by a factor of 2-4 times longer uploading, than executing).
When profiling my kernels I start a timer (actually 2, my own performance timer, and CUDA’s timing events - both are generally within 50us of each other) just before I start setting up the kernel parameters (cuSetParam*), and stopping the timer afters I synchronize (for profiling purposes) after launching the kernel (cuCtxSynchronize after cuLaunchGrid).
Surprisingly, it’s worse if I use pinned memory w/ async memory transfers (the overhead involved in creating the pinned memory, copying my already existing memory over to it, and then deleting it afterwards is … tremendous).
In most cases I have at least 2kb of data I need to upload (and in most cases, download back to the host as well) - which generally consists of 2 and 3 dimensional vertex data, in extreme cases I have 10kb+ (normals/texture coordinates, 3x4/4x4 matrices, etc) - which generally takes 250us-1000ms to upload from pageable memory - and generally a lot less for pinned memory (but freeing it later has such a performance hit that it’s not worth it).
I’m currently working on minimizing how much I have to upload (eg: only uploading vertices when they change, etc) - but even with these optimizations, uploading even just a few bytes of data takes 50-100us (even worse on laptops/etc with bandwidth issues, in some cases 500us+ for 64 bytes) - compared to kernels which take anywhere between 50us and 400us.
I just wanted to know if everyone else is experiencing similar issues, or maybe some smarter ways of using pinned memory?