Accelerate Host <-> Device Memory Transfer Besides CudaMallocHost


Is there any way to further accelerate CUDA memory transfer between host and device beside of calling CudaMallocHost?


If your problem allow you to partition the problem, you can hide memory latency (to a certain degree) by performing transfers and kernel execution simultaneously. Take a look at the

stream API example in the SDK.

Another option is to avoid them, by performing more work on the GPU.

Just to state every option: The motherboard has a LOT to do with transfer speed.

Use a Nehalem–because of the improved CPU-side memcpy performance, paged memcpys are much, much faster than on previous chips.


Do Opterons do anything like this at all?

I haven’t tried the DDR3 Phenom 2s, so I’m not sure. However, the paged bandwidth off Barcelona and such doesn’t seem to be that high–it’s certainly not in the same ballpark as Nehalem, where you lose 20% of pinned perf or so in optimal configurations.

Yeah, our 3.0 GHz Phenom II with DDR2 still sees a 50% drop in bandwidth going from pinned to pageable memory.

I forgot, did you test your Nehalem bandwidth with only 2 channels to see what the effect was?

Yeah, it went from 5.2 or so with DDR3 1600 to ~4.5 (at DDR3 1600) or ~3.8 (DDR3 1066).