PCIe DMA broadcast & CUDA

Does CUDA currently support broadcast DMA transfers? I’m experiencing some bottlenecks in transfers from memory to multiple cards and this seems to be an option with some MCH’s like the one found in the 790i Ultra SLI. If not, is there a way to do GPU to GPU transfers via the PCIe bus that is supported? I’ve seen this done with opengl shader demos before.

Also, is there a way to share cudaMallocHost regions among devices? I realize that the CUDA runtime makes calls to the driver to setup the region, but couldn’t the runtime expose a way to reuse these regions?

The answer to both questions is - not currently, but we’re working on it.

See this thread:

Fast peer-to-peer GPU transfers is one of my most desired CUDA features, so I’m on your side!