I found a potential anomaly in the bandwidth test returned by Cuda application bandwidthTest. This is running on my mac pro 8-core (previous generation, not nehalem) with 8800 GT card. The host-to-device bandwidth is 10 times slower than device-to-host! Am I missing something? I thought PCIe bandwidth would be symmetric, or at least the read-back would be slower, and not the other way around.
I am trying out Cuda 2.3 sdk on OSX 10.5.7 with 2.3 Cuda driver.
Pinned memory is a chunk of host memory which has been marked by the CUDA driver as “unmovable” to the operating system. The OS is not allowed to relocate the memory (virtual address translation makes this possible without invalidating pointers) or swap the memory block to disk.
This is important because memory transfer between the host and the CUDA device is done with a DMA transaction. This requires the memory block on the host to have a fixed physical address. If you do not use pinned memory, CUDA instead DMA transfers a block of data (possibly smaller than your request) to a private pinned memory location inside the driver, then copies that data into your non-pinned memory block. This process repeats until your entire requested memory transfer is complete. The overhead of two copies makes non-pinned (aka “pageable memory”) transfers much slower than pinned memory transfers on many systems, usually about half speed. (The one exception to this are the triple channel Core i7 systems, which have so much memory bandwidth that you barely notice the difference between pinned and pageable memory at all.)
Your extremely poor host-to-device bandwidth in the pageable case suggests there is something very wrong with the two-stage copy process on your system. I don’t know what would cause that. However, if you can used pinned memory as a workaround until you can figure that out, you’ll be fine. (And your memory transfers will be faster on other systems as well.)
The CUDA programming guide has more info on pinned (also called “page-locked”) memory.
I’m getting the same thing on my Mac Pro with OS X 10.5.7 and CUDA 2.3 and with the 2.3 driver. This seems to be a bug since my Mac Book Pro (8600M) with CUDA 2.1 gets 10x better performance for Host to Device paged memory.