I was hoping someone could clarify how mapped memory transfers are queued on Fermi devices. Based on the streams and concurrency webinar, I understand that Fermi has three stream queues: host-to-device memory transfer, compute engine, and device-to-host memory transfer. It is my understanding that only the host issues stream operations; so, specifically, my questions are as follows:
[list=1]
[*]What happens when a kernel uses mapped memory transfers? Specifically, how do mapped memory operations make it to a H2D or D2H queue, or is there some other (magical?) way memory is transferred?
[*]Are mapped memory D2H transfers blocked until all scheduled kernels (issued in different streams) have finished executing?
[*]Is this something that can be used to have multiple concurrent kernels execute such that D2H transfers are not blocked until all scheduled kernel operations have completed?