Hi,
I’m developping a driver for a pci-express device we’re designing and that device will DMA data into RAM and then an application has to transfer that data to gpu. Right now, memory has to be created in the driver and I intended to map it to user space so it could pass it to memcpy host to device. As a first test, I created a block on contiguous memory in my driver, created an mdl for it and mapped the memory to user space. Then, I memcopied it to the gpu and fetched the bandwith of the transfer. The results I had were about 2700MB/s. The thing is, I need the 4500-5500 MB/s that can be achieved using cudaHostAlloc with pinned memory… I’m now trying to reproduce the same kind of memory but in my driver… Is it something feasible? What does cudaHostAlloc do differently? My memory created in the driver was contiguous and from non paged memory… Why is the tranfer so “slow”?
Thanks!!