It’s right to get a good performace by using pinned host memory(page-locked) for transfering from Host to Device. However, I have to transfer my data which is located in pageable memory to page-locked memory after all.
If I use a function like “memcpy”, it’s synchronous, which costs 5-10ms every frame.
And if I use a CPU work thread to memcpy it asynchronously, it also taks several milliseconds to synchronous the thread.
CUmemcpyAsync is able to transfer from host to host but is only used when unified addressing is enable which is not possible on Windows XP.
So am I wrong? Is there any other CUDA funciton or way to transfer from pageable host memory to page-locked host memory asynchronously?
Copying from pageable to pinned memory and from there to the device is exactly what the driver does for host->device copies from pageable memory, so you won’t gain any speedup by explicitly coding that yourself.
What you can do with newer CUDA toolkits is to pin down pageable memory (without performing a memcpy) using cudaHostRegister() / cuMemHostRegister() before copying from there to the device.
EDIT: It is a reasonable assumption that newest or future versions of the driver will take the cuMemHostRegister() path themselves, so explicitly coding a memcpy into your application would even slow things down for future drivers.
Any idea what happens on systems with dual Intel Xeon CPU? In these configurations its common to have one PCIe x16 slot directly connected to each CPU and of course each CPU has its own memory controller and memory so presumably there is a performance benefit to having the pinned memory local to whichever CPU is connected to the GPU. What is the best way of making sure this happens?