Page-locked memory

Nate75Sanders · February 1, 2008, 2:21pm

This is somewhat off-topic as I’m interested in page-locked memory in general, but it’s relevant to CUDA, so here goes:

1 - Can someone give an explanation of the mechanism by which page-locking is achieved?

I’m guessing this has to do with mlock() which probably sets the VM_LOCKED flag? I realize that I don’t have to do that and that the nvidia API takes care of it for me, but I’m just curious.

2 - Why are H-to-D and D-to-H transfers faster when using page-locked memory?

It seems obvious that in a loaded system page-locking is going to be faster (because the pages you need might be swapped out), but it seems to be faster even with very little load. Is there something in the Linux kernel that makes page-locked transfers faster or is there something in nvidia’s API that makes it faster (or both)?

3 - Can normal applications that are CPU-only (and may or may not use some bus device, if that matters let me know if you don’t mind) achieve performance benefits (not characteristic benefits, such as security) from using page-locked memory in a system where almost no paging is going on to begin with? Again, obviously it could help if the machine was heavily loaded…

if there are links/articles/etc that I can read that clear up some of this, I’d really like that. Also if DMA is involved in some of this and somebody knows where I can read about DMA systems that are in use, that might be nice, also – especially as it relates to CUDA programming, but also in general would be fine, also.

Thanks,
Nate

nwilt · February 1, 2008, 2:43pm

There is a fair amount of Rocket Science™ under the hood to make DMA transfers happen. In addition to page-locking the memory so the OS will not move it around, any DMA peripheral (not just our GPUs) must set up scatter/gather DMA hardware to access the memory. The driver also must ensure that in the event of process exit, the DMA setup will be torn down before the pages are given back to the operating system (a guarantee that is easier to make on some platforms than others).

DMA transfers are faster because the hardware can drive the PCI Express protocols directly, much faster than (say) having the CPU initiate write-combined memory transactions. Also, the hardware can do these transfers concurrently with CPU execution.

In CUDA, all memory transfers are done via DMA; the pageable ones are staged into a pair of private DMA areas (so the driver can ping-pong CPU copies to/from one while the hardware is transferring to/from the other). This is slower than the direct transfer because it involves lots of CPU time as well as an extra copy.

In general the operating systems do a good job of keeping things resident that need to be resident. If no paging is going on, everything must be resident anyway so page-locking will not give a performance benefit. The best practice is to keep memory pageable unless you have some compelling reason to lock it down.

Nate75Sanders · February 1, 2008, 7:06pm

Still a bit confused…

I think here is where you’re explaining the reasons for performance difference between pageable and page-locked memory.

Could you perhaps explain this part in more detail or point me to somewhere where I can read about this?

Thanks,

Nate

DenisR · February 1, 2008, 7:19pm

In a nutshell the difference is:

a mem-transfer from pageable memory is :

a mem-copy from pageable to non-pageable memory
DMA from non-pageable memory to the device

a mem-transfer from non-pageable memory is :

DMA from non-pageable memory to the device

xjtusnail · March 18, 2009, 2:51am

There is a fair amount of Rocket Science™ under the hood to make DMA transfers happen. In addition to page-locking the memory so the OS will not move it around, any DMA peripheral (not just our GPUs) must set up scatter/gather DMA hardware to access the memory. The driver also must ensure that in the event of process exit, the DMA setup will be torn down before the pages are given back to the operating system (a guarantee that is easier to make on some platforms than others).

DMA transfers are faster because the hardware can drive the PCI Express protocols directly, much faster than (say) having the CPU initiate write-combined memory transactions. Also, the hardware can do these transfers concurrently with CPU execution.

In CUDA, all memory transfers are done via DMA; the pageable ones are staged into a pair of private DMA areas (so the driver can ping-pong CPU copies to/from one while the hardware is transferring to/from the other). This is slower than the direct transfer because it involves lots of CPU time as well as an extra copy.

In general the operating systems do a good job of keeping things resident that need to be resident. If no paging is going on, everything must be resident anyway so page-locking will not give a performance benefit. The best practice is to keep memory pageable unless you have some compelling reason to lock it down.

The primary difference is need or not need cpu involvment data transfer, so why overlap data transfer with kernel execution must need cudamallochost to get page-locked memory? my means is even if CPU participate data transfer, which also does’t influence GPU computation. why overlap of data transfer with GPU computation must use page-locked memory?

i am very confuse about it.

Look forward to someone’s reply!

tmurray · March 18, 2009, 3:03am

Because there’s no way for the GPU (or any device) to do DMA from pageable memory because the CPU could invalidate the pages while the other device is doing a DMA transfer.

jasonsfa98 · April 6, 2009, 6:43pm

Is the amount of page-locked memory equal to the amount of physical memory in the machine?

tmurray · April 6, 2009, 6:51pm

No. You won’t be able to allocate that much because of the kernel’s memory region, other apps that need to run occasionally, etc.

(and this is even more confusing on Vista)

tmurray · April 6, 2009, 6:51pm

No. You won’t be able to allocate that much because of the kernel’s memory region, other apps that need to run occasionally, etc.

(and this is even more confusing on Vista)

darot · April 8, 2009, 5:36am

Sorry, what is the kernel’s memroy region. how to get the number?

And, can the page-lock memory be use to do non-cuda calculate like sse? will it get any benifit ?

Topic		Replies	Views
question about page locked memory CUDA Programming and Performance	2	8803	April 21, 2009
Async transfers with non-cuda host memory using page-locked memory not cuda memory CUDA Programming and Performance	5	11629	July 4, 2008
Transfer Speed For AWE-Allocated Memory CUDA Programming and Performance	6	2935	March 20, 2013
CUDA 8.0 CudaMemcpy with Pageable Memory CUDA Programming and Performance	13	3209	December 6, 2016
Zero Copy VS Page-Locked CUDA Programming and Performance	5	1137	September 19, 2011
How to understand CPU memory transfer data to GPU memory speed problem CUDA Programming and Performance	4	3795	December 18, 2017
Why can page-locked Memory be acc in memcpy funciton CUDA Programming and Performance	1	3527	April 6, 2009
Pinned and Pageable memory CUDA Programming and Performance	5	2429	January 16, 2020
Is it possible to use pinned memory? Outside of CUDA CUDA Programming and Performance	14	6281	January 22, 2025
why using pinned memory is faster? CUDA Programming and Performance	3	2863	November 30, 2007

Page-locked memory

Related topics