Speeding up copies with OpenGL VR SLI

Hi, I’m coding for dual 1080’s in an SLI config, and rendering 4 views in two render passes, then copying 2 render textures back onto GPU0 for presentation. I’m rendering to large render targets (each is approx 3k x 2k pixels, RGBA16F format), and the MulticastCopyImageSubData call is killing performance.

I can create a similar situation by hacking up the original vr_ogl_sli sample to render to 4 similarly sized render targets, and copying 2 of them back. As the render target size (and bit depth) increases the copy performance goes downhill fast, and it doesn’t take much to get a slower frame rate than just single card rendering.

Are there any guidelines for how to optimise rendering/copying operations to get the best out of this? What sort of bandwidth would be reasonable to expect across the SLI bus (to get an idea of what’s feasible).

TIA, any help appreciated!