Is it possible to measure how long it took to copy host memory to GPU memory? I looked at 2.1.2 Using CUDA GPU Timers in the best practices guide, but that seems to be a host only timer, which I believe is only useful for wrapping around the kernel launcher. Is there anything else I can try?
The standard copy operations are blocking, so you can use a host timer. Alternatively, there is the cuda events API, which uses device side timers. Either will work fine.