With discrete GPUs, there is no getting around copying the data across PCI Express. But, you can avoid the overhead of a memcpy call by using mapped pinned memory: instead of calling cudaMalloc() to allocate device memory that you copy to/from, call cudaHostAlloc() with the cudaHostAllocMapped flag. This passes back pinned host memory that you can access with the CPU, but that also has been mapped into the CUDA address space. Call cudaHostGetDevicePointer() to get the device pointer of that pinned memory.
The kernel can write to the device pointer corresponding to the mapped pinned memory - this is actually a preferred mode of operation, since the GPU is just posting writes and there is no latency to cover - just remember to do CPU/GPU synchronization before reading the results written by the kernel. cudaDeviceSynchronize() is the big hammer (waits until the GPU is done processing), you need to use CUDA events for finer-grained synchronization.
Just as a warning, don’t overdo it on the pinned memory; pinned memory cannot be paged out, which can bad if you allocate too much of it (since your system can’t page it out, it’ll have less overall memory to work with, which in turn can increase your page fault rate).
You’d be surprised how much pinned memory you can allocate before it starts to noticeably drag on system performance.
I was: a couple years ago, I downloaded a CPU benchmark (this was on Windows - I think it was Futuremark) and ran it as a baseline, then created a CUDA program that performed a pinned allocation of a specified size and wait for a keypress before exiting.
That way, I could pin variable amount of memory and rerun the benchmark, watching for performance degradation.
The benchmark didn’t start to exhibit slower performance until 50% of physical RAM had been pinned. If I recall correctly, the machine got pretty sluggish (but still useable) at 75%. I was surprised the OS was letting me pin that much, to be honest with you; no application should try to pin that much memory anyway.
It was long enough ago that I’d have to re-do the test in order to report the results more formally. But it’s not a hard test to undertake yourself.
Obviously, YMMV on the specific machine and the workloads you are running concurrently with your CUDA app, but I have not felt the slightest twinge of guilt bout allocating pinned memory since doing that study.
If the system has 96GB of memory and Windows 7 can run happily on 2GB then wouldn’t you expect to be able to pin 94GB without it affecting general system performance? A lot of the problems that GPUs are being used to tackle require a lot of memory and with the introduction of the unified address space and improved cache on Fermi I would have thought that you might well want to pin this much memory.
In fact, something else I’ve been wondering is whether nVidia might make it possible to allow some device memory to be used as a hardware managed cache. I think it would already be possible to implement a fully associative cache inside a kernel (I once considered doing this with shared memory on previous generation hardware) but I don’t think it would perform very well and a more complex n-way cache would probably be quite tricky to do in software.