The OpenCL documentation says “The function clEnqueueMapBuffer(…) enqueues a command to map a region of the buffer object given by buffer into the host address space and returns a pointer to this mapped region”. My interpretation of this statement is that the CPU configures its own memory management unit, the PCI bus controller, and the GPU’s memory manegement unit, such that an access to this pointer goes through the PCI bus into the corresponding region of GPU memory. The file “oclBandwidthTest.cpp” included in Nvidia’s OpenCL SDK samples confirms this interpretation since in mapped mode the time of a repeatedly performed “memcpy” is measured to compute the CPU-GPU bandwidth (which makes sense if we assume that one of the pointers actually refers to GPU memory).
However, I made some observations which do not support this interpretation:
*) clEnqueueMapBuffer takes approximately the same time as copying the entire buffer between host and GPU memory
*) clEnqueueUnmapMemObject takes as long as clEnqueueMapBuffer, unless the mapping was configured to be “read only”, in which case the unmapping is much faster and almost independent of the size of the mapped region
*) the performance of “memcpy” is the same for each combination of host pointer and mapped device pointer (I guess it should be faster if both pointers refer to host RAM)
*) a mapped buffer can’t be used by an OpenCL kernel (the program silently fails)
So it looks like GPU memory is actually not mapped into the CPU’s address space, but copied to CPU memory by the “map buffer” operation and copied back by “unmap buffer”. The specification is not clear about this (it allows caching of host memory specified with the CL_MEM_USE_HOST_PTR flag, but that’s a different story). Can somebody please clarify this?
Thanks & kind regards,