I’m trying to use pinned memory to improve the speed of a little algorithm, but I always had an error “unspecified launch failure in prior launch.”. So I tried the simpleZeroCopy example in the SDK and it returned “Device 0 cannot map host memory!” which happens when the device does not have the canMapHostMemory property. Why can’t I map host memory?
I’m using a GeForce 9500 GT on a Ubuntu 9.10 (standard x86 version, no 64bits), and I’m using the toolkit 2.2 and the
kernel module version 185.18.08 . Can somebody give me some insight on the situation?
I’m not sure what the trouble you are having with pinned memory is, but mapping host memory into the device memory space is only supported on the GT200 series and some of the integrated motherboard GPUs. The 9500 cannot map host memory.
Can you explain more about the original pinned memory problem? That should be fixable.
I was trying to adapt an already fast algorithm on cuda (as a first and quick introduction for my training course), but according to measurements, the device->host cudaMemcpy was the slowest part of the program (which was making the cuda-enabled program slower than the original), having read on the forum that using pinned memory (which is the zero copy program, if i’m not mistaken?) could improve performance, I wanted to give a try.
I suspected that there could be a hardware requirement, but could not find any information. Now I’m sure. Thank you for this quick answer!
Pinned memory and zero-copy (i.e. mapping host memory into the device memory space) are different things. Pinned memory blocks are regions of host memory which the OS cannot move. This allows the CUDA driver to take a shortcut when transferring data between the CUDA device and the pinned memory. On many systems, this results in double the effective host-to-device and device-to-host bandwidth. Pinned memory blocks are created by allocating them with cudaMallocHost() instead the usual malloc(). Pinned memory (or “page locked memory” as it is also called) can be created on any CUDA system, as it does not depend on the device at all.
“Zero-copy” actually maps a block of host memory into the memory space of the device. The memory controller on the device transparently issues DMA requests over the PCI-Express bus to copy data over to the device (or from the device) on demand. This means that you can much more easily transfer data between the device and the host as your kernel is running, removing the need for an explict cudaMemcpy before or after the kernel executes. Mapping host memory into the device space requires several commands. If you are curious about the details, check out this whitepaper:
I understand now, I (hastily) assumed those were the same and went to the zero copy. I’ll re-read the reference manual and examples keeping that in mind. I’ll go for the pinned memory, but I have a laptop with an MCP79 board and try zero copy after that