The only search results related to this topic that I could find was this post, which wasn’t very definitive. IIRC, Mark said the ‘ball & bricks’ physics demo from the GDC a couple years ago was doing direct DMA from one GPU to another, but that was either OGL or DX.
Any way to do this using CUDA, or do I have to burn 2x host memory bandwidth for it?