The only search results related to this topic that I could find was this post, which wasn’t very definitive. IIRC, Mark said the ‘ball & bricks’ physics demo from the GDC a couple years ago was doing direct DMA from one GPU to another, but that was either OGL or DX.

Any way to do this using CUDA, or do I have to burn 2x host memory bandwidth for it?

There is a old post somewhere, where an NVIDIA rep stated that Fast GPU to GPU copies were on the todo list. I don’t remember where the post is.

My hope is that is shows up in the beta version of CUDA they are about to release, but I don’t know if that is the case. I haven’t even seen any rumors about what features might be in the update.