How are mapped memory transfers queued?

I was hoping someone could clarify how mapped memory transfers are queued on Fermi devices. Based on the streams and concurrency webinar, I understand that Fermi has three stream queues: host-to-device memory transfer, compute engine, and device-to-host memory transfer. It is my understanding that only the host issues stream operations; so, specifically, my questions are as follows:


What happens when a kernel uses mapped memory transfers? Specifically, how do mapped memory operations make it to a H2D or D2H queue, or is there some other (magical?) way memory is transferred?

Are mapped memory D2H transfers blocked until all scheduled kernels (issued in different streams) have finished executing?

Is this something that can be used to have multiple concurrent kernels execute such that D2H transfers are not blocked until all scheduled kernel operations have completed?

What do you mean by a “mapped memory transfer”?

I’m referring to the mapping of a block of page-locked host memory into the address space of the device by passing the ‘cudaHostAllocMapped’ flag to cudaHostAlloc() (see section of the C programming guide). This is supposed to make it so that the kernel can directly access the page-locked memory without the host issuing any H2D/D2H copies.

In that case, the transfer doesn’t involve the hardware queues at all. Instead the memory controller on the GPU issues PCI-Express DMA reads directly as kernels request memory from a mapped address. Reading mapped memory from the device basically looks like a very high latency read to global memory.