How to force CUDA to use DMA for memcpy


I am running samples/1_Utilities/bandwidthTest. I see great performance for DeviceToHost and HostToDevice operations.

However, if instead of RAM I am using MMIO space of some other device, the performance drops dramatically. I also observe, that instead of using DMA in this case, the CPU is used (which actually makes the performance so bad…) Is it possible to force cuda to still use DMA instead of CPU to do that copy?


The basic OS driver model generally prevents PCI device A from writing directly to a buffer owned by PCI device B without doing some special things (in the drivers).

If you want to transfer data directly to/from a PCI device that is on the same PCI fabric as a GPU, then the defined method for that is GPUDirect RDMA:

• RDMA for GPUDirect Doc page (
• GDRCopy github project (

This assumes you have access to the driver source code for your device and are a reasonably proficient driver writer for the OS in question.

Unless you’ve done that, CUDA cannot write directly to your device, but instead will write to system memory, and if that memory is not pinned, then maximum transfer speed cannot be achieved.

txbob I noticed GDRCopy claims it can be faster than cudaMemcpy. Have you tried this yourself? I tried the sample benchmarks and it performed worse.

edit: sorry, I forgot I asked here, so I posted some results in another thread.

And which thread did you post to?