I also got a reply from Nvidia which may of interest to some. Apparently, the overlapping of MPI processes will be supported in CUDA 5.5 this summer.
Quote from Ujval Kapasi:
You can do that now even on older HW, actually. Basically, you can run a different process on each core in your node, corresponding to different MPI ranks in your application. Each process can issue work (PCI transfers and computation) to the same GPU.
However, the older hardware and software will not overlap execution of items issued by different processes. These will be handled in serial.
However, HyperQ on K20 is better because it allows the hardware to overlap exectuion of items from different processes on the same node, when possible. In order to access that functionality on K20, you will need CUDA 5.5, which has not been released yet.
When CUDA 5.5 is released this summer, it will contain support for this. You will need to run a special server process to enable the functionality, and hence you will need system administrator priveledges on your node.