how to best transfer memory between GPUs sitting on different PCI controllers

I’m build a piece of software that needs to run on a dual CPU system with multiple GPUs (a lot of them) and I’m looking for the best way to transfer data between GPUs.

If all GPUs sit on the same PCIe bus I should be able to use gpuDirect (peer to peer access) to copy data between GPUs without going through the host. What happens though if I want to copy data between GPUs sitting on different PCIe ports (and are in fact related to different CPUs in the same system)?

gpuDirect should fail, but does it go through the host or does it fail completely? Also, I know that there is growing support in MPI for using GPUs, is it possible to run things on different CPUs (with CPU affinity) using MPI and transfer data between GPUs that way effectively withough going through the host? If so, which MPI implementation is best supported at the moment? (I have experience with both mpic2 and openmpi)