Why, how and when to use page locked host memory

I am transferring many chunks of data between host and device and have read that page locked memory can be used to speed this transfer up.

Why is this?

And how and when do I declare/allocate and use page locked memory?

For example, I have on the host many arrays of several million data items and copy the data in chunks in a loop. How do I declare these arrays as page locked, do I need do this only once or in each iteration of the loop?

Page-locked memory is memory which cannot be relocated by the OS. Any data sent to the GPU have to be in page-locked memory - if you were copying an array to the device (via DMA) and the array was swapped out to disk mid-copy Bad Things would happen. You don’t have to use page-locked memory explicitly - calls to cudaMemcpy automatically make use of internal page-locked buffers if necessary. However, this adds an extra copy to the process: pageable->page-locked->device. If you have a pagelocked array, then cudaMemcpy doesn’t have to do this extra copy, speeding things up (typically a factor of two, YMMV). To manipulate page-locked memory, you use cudaMallocHost and cudaFreeHost.