I was hoping someone could clarify how mapped memory transfers are queued on Fermi devices. Based on the streams and concurrency webinar, I understand that Fermi has three stream queues: host-to-device memory transfer, compute engine, and device-to-host memory transfer. It is my understanding that only the host issues stream operations; so, specifically, my questions are as follows:
[list=1]
What happens when a kernel uses mapped memory transfers? Specifically, how do mapped memory operations make it to a H2D or D2H queue, or is there some other (magical?) way memory is transferred?
Are mapped memory D2H transfers blocked until all scheduled kernels (issued in different streams) have finished executing?
Is this something that can be used to have multiple concurrent kernels execute such that D2H transfers are not blocked until all scheduled kernel operations have completed?