How are mapped memory transfers queued?

kleboeuf · June 7, 2012, 1:01pm

I was hoping someone could clarify how mapped memory transfers are queued on Fermi devices. Based on the streams and concurrency webinar, I understand that Fermi has three stream queues: host-to-device memory transfer, compute engine, and device-to-host memory transfer. It is my understanding that only the host issues stream operations; so, specifically, my questions are as follows:

[list=1]

[*]What happens when a kernel uses mapped memory transfers? Specifically, how do mapped memory operations make it to a H2D or D2H queue, or is there some other (magical?) way memory is transferred?

[*]Are mapped memory D2H transfers blocked until all scheduled kernels (issued in different streams) have finished executing?

[*]Is this something that can be used to have multiple concurrent kernels execute such that D2H transfers are not blocked until all scheduled kernel operations have completed?

seibert · June 7, 2012, 1:54pm

What do you mean by a “mapped memory transfer”?

kleboeuf · June 7, 2012, 2:12pm

I’m referring to the mapping of a block of page-locked host memory into the address space of the device by passing the ‘cudaHostAllocMapped’ flag to cudaHostAlloc() (see section 3.2.4.3 of the C programming guide). This is supposed to make it so that the kernel can directly access the page-locked memory without the host issuing any H2D/D2H copies.

seibert · June 7, 2012, 3:50pm

In that case, the transfer doesn’t involve the hardware queues at all. Instead the memory controller on the GPU issues PCI-Express DMA reads directly as kernels request memory from a mapped address. Reading mapped memory from the device basically looks like a very high latency read to global memory.