This is somewhat off-topic as I’m interested in page-locked memory in general, but it’s relevant to CUDA, so here goes:
1 - Can someone give an explanation of the mechanism by which page-locking is achieved?
I’m guessing this has to do with mlock() which probably sets the VM_LOCKED flag? I realize that I don’t have to do that and that the nvidia API takes care of it for me, but I’m just curious.
2 - Why are H-to-D and D-to-H transfers faster when using page-locked memory?
It seems obvious that in a loaded system page-locking is going to be faster (because the pages you need might be swapped out), but it seems to be faster even with very little load. Is there something in the Linux kernel that makes page-locked transfers faster or is there something in nvidia’s API that makes it faster (or both)?
3 - Can normal applications that are CPU-only (and may or may not use some bus device, if that matters let me know if you don’t mind) achieve performance benefits (not characteristic benefits, such as security) from using page-locked memory in a system where almost no paging is going on to begin with? Again, obviously it could help if the machine was heavily loaded…
if there are links/articles/etc that I can read that clear up some of this, I’d really like that. Also if DMA is involved in some of this and somebody knows where I can read about DMA systems that are in use, that might be nice, also – especially as it relates to CUDA programming, but also in general would be fine, also.
There is a fair amount of Rocket Science™ under the hood to make DMA transfers happen. In addition to page-locking the memory so the OS will not move it around, any DMA peripheral (not just our GPUs) must set up scatter/gather DMA hardware to access the memory. The driver also must ensure that in the event of process exit, the DMA setup will be torn down before the pages are given back to the operating system (a guarantee that is easier to make on some platforms than others).
DMA transfers are faster because the hardware can drive the PCI Express protocols directly, much faster than (say) having the CPU initiate write-combined memory transactions. Also, the hardware can do these transfers concurrently with CPU execution.
In CUDA, all memory transfers are done via DMA; the pageable ones are staged into a pair of private DMA areas (so the driver can ping-pong CPU copies to/from one while the hardware is transferring to/from the other). This is slower than the direct transfer because it involves lots of CPU time as well as an extra copy.
In general the operating systems do a good job of keeping things resident that need to be resident. If no paging is going on, everything must be resident anyway so page-locking will not give a performance benefit. The best practice is to keep memory pageable unless you have some compelling reason to lock it down.
The primary difference is need or not need cpu involvment data transfer, so why overlap data transfer with kernel execution must need cudamallochost to get page-locked memory? my means is even if CPU participate data transfer, which also does’t influence GPU computation. why overlap of data transfer with GPU computation must use page-locked memory?
Because there’s no way for the GPU (or any device) to do DMA from pageable memory because the CPU could invalidate the pages while the other device is doing a DMA transfer.