Page-Locked Host Memory without using cudaHostAlloc()

Has anybody experience with using page-locked memory that is not allocated with one of the CUDA-APIs on Windows OSs, e.g. with the VirtualAlloc()/VirtualLock() mechanism?

  • Is it possible to copy data from host memory to device memory and vice versa?
  • Is it as fast as using host memory that is allocated with cudaHostAlloc()?

One of the limitations using cudaHostAlloc() or VirtualAlloc() is that the memory is only accessible by the same process. Additionally, it may be swapped out by the OS if the process isn’t really active.
Has anybody experience with using page-locked host memory that is accessible beyond process boundaries, e.g. with the help of the Windows DDK (I think with something like IoAllocateIrp()).

Use Case:
Process A copies some data frome somewhere to the page-locked host memory via DMA.
Then the computing thread of process B copies the data from the page-locked host memory to some computing device via cudaMemcpy().

This is what Mellanox is doing with their GPU Direct technology. I think you’ll need to be quite cosy with NVIDIA to get that level of access (at least for the moment).