Cpu-to-gpu data transfer query

My understanding of cpu to gpu transfer is as follows: if the data is in pagable memory and it is actually not on RAM, the OS creates a copy of the data in the pinned region, and is then transferred to the device memory. I have two questions:

  1. If the pagable memory buffer is present in RAM, does the OS lock the page in-place?
  2. If the pagable memory buffer is in secondary storage, why can’t it transfer using GPUDirect and skip a copy?
  3. How does the performance vary when allocating a pagable memory buffer vs allocating a pinned memory buffer?

Thanks!

The proper mental model here is that CUDA always copies pageable to pinned, before transferring.

CUDA has no knowledge whether a given pageable address is actually paged out, or not.

  1. No, locking the page for a one-off transaction typically is slower than copying.
  2. Guessing here: There would be no use case for it. Whoever uses GPUDirect has a workstation or embedded GPU and would typically run a highly optimized custom application. Having not enough RAM and relying on the operating system to swap out, would be very untypical. In those situations, the application developer would rather choose a solution with more control, e.g. the application controls the swapping instead of the operating system.
  3. Are you asking about the performance of allocating or of copying after having allocated? Copying pinned memory is faster. Allocating memory depends. Has the application allocated a pool beforehand? Does the operating system have to free/swap something else from the now newly allocated pinned memory? Allocating should be moved outside performance-critical parts of the application, e.g. outside of loops.

First of all, GDS expects a filesystem interface (to wit: cuFile): a chunk of data paged out to Disk by the host OS into an opaque paging buffer is nothing like that. Second, GDS has specific software requirements for the storage software stack, plus system topology requirements, none of which are satisfied in the general case where a cudaMemcpy call may take place.