mapping vs. copying CPU memory

Hi!

The OpenCL documentation says “The function clEnqueueMapBuffer(…) enqueues a command to map a region of the buffer object given by buffer into the host address space and returns a pointer to this mapped region”. My interpretation of this statement is that the CPU configures its own memory management unit, the PCI bus controller, and the GPU’s memory manegement unit, such that an access to this pointer goes through the PCI bus into the corresponding region of GPU memory. The file “oclBandwidthTest.cpp” included in Nvidia’s OpenCL SDK samples confirms this interpretation since in mapped mode the time of a repeatedly performed “memcpy” is measured to compute the CPU-GPU bandwidth (which makes sense if we assume that one of the pointers actually refers to GPU memory).

However, I made some observations which do not support this interpretation:
*) clEnqueueMapBuffer takes approximately the same time as copying the entire buffer between host and GPU memory
*) clEnqueueUnmapMemObject takes as long as clEnqueueMapBuffer, unless the mapping was configured to be “read only”, in which case the unmapping is much faster and almost independent of the size of the mapped region
*) the performance of “memcpy” is the same for each combination of host pointer and mapped device pointer (I guess it should be faster if both pointers refer to host RAM)
*) a mapped buffer can’t be used by an OpenCL kernel (the program silently fails)

So it looks like GPU memory is actually not mapped into the CPU’s address space, but copied to CPU memory by the “map buffer” operation and copied back by “unmap buffer”. The specification is not clear about this (it allows caching of host memory specified with the CL_MEM_USE_HOST_PTR flag, but that’s a different story). Can somebody please clarify this?

Thanks & kind regards,
Markus

Let me put the question differently:

Is it possible to map GPU memory into the CPU’s address space, and do current Nvidia hardware and drivers support this?

Thanks & kind regards,
Markus

With CUDA, you can only do the reverse, and map CPU memory into the GPU’s address space. It generally requires a GTX 200 or later desktop card. (Exception is the GeForce 9400M)

No idea if this is exposed to OpenCL.

If you look @ the memory map of a TESLA card, you will know that 4GB of RAM is NOT exposed via PCI memory space.
Only the control registers are exposed (OR) the hardware uses some banking scheme to write to different regions of PCI memory space…

So, when you copy data to GPU, the driver (most likely) copies this data first to a pinned memory region and then asks the card to DMA it.
May be, it uses a double buffer to overlap the PCI DMA with the CPU mem-copy…

Hope this helps,

I vaguely remember Tim Murray of Nvidia saying that the hardware does not support mapping GPU memory into the host address space, but can’t find the thread at the moment.