I have a big junk of memory (8GB) reserved at boot time by passing “mem=4G memmap=8G$4G” to the kernel. I use this memory to capture data from a bespoke PCIe card. I ioremap this memory into kernel virtual memory, and also into user virtual memory by a simple char device driver. From user space I transfer data blocks (mmaped) onto the Tesla C1060 for processing, and I notice that it is quite slow. What can I do about that?
I guess that the cudaMallocHost makes uses the kernels ability to allocate DMAable RAM, RAM that is in contiguous pages the DMA subsystem can access. I would like to tell CUDA: cudaUseAsPinnedMemory(addr, size); which will use it as DMA-able memory. Provided this magic call, the data transfer to the Tesla would presumingly be much quicker than the transfer of malloced (or mmapped) memory.
This is of course only a workaround of the ideal situation: Transfer data from a bespoke PCIe card directly onto the Telsa, perhaps not even involging temporary copies into CPU registers, but a true peer to peer copy.