DMA for CUDA Transfering data to CUDA device mem passing by CPU

I cannot find any information about something like DMA for CUDA purpose. I want to transfer huge data amounts to CUDA device memory from my own connection card, which is connected to
computer via PCI-Express 2.0 x16 slot. I’m searching for a solution in which the transfer will pass by CPU, so whole transmission management must be done by graphic card or my own connection card.
Has anyone heard about a something like that? Any links or docs?

Currently, the best you can do:

  1. DMA from your card to buffer X in system memory.
  2. Copy from buffer X to another buffer Y in system memory. Buffer Y was marked as “pinned” by the CUDA API when it was created, speeding up host-to-device transfers.
  3. Copy from buffer Y to CUDA device memory.

A recent webinar suggests that NVIDIA will improve this procedure in the near future by removing the need for step 2. Presumably, this will allow you to use the same pinned buffer in system memory for your card and for CUDA host-to-device transfers.

No solution exists to do this directly over the PCI-Express bus between two cards, though NVIDIA employees are aware of the desire for this capability. They have not given any timeline for direct card-to-card DMA, or even acknowledged that such work is underway.

Fermi has new DMA hardware… we know it now allows multiple simultaneous DMAs which previous GPUs did not. NVIDIA hasn’t said this will work with card-to-card DMA, but the fact that DMA was redesigned is a promising clue anyway.