The limitation of the bus between the GPU GDDR5 and DDR3 is severe. I am trying to make my program distributed and the bus slow bandwidth takes away all the advantages that distributed programming can give you. I do not say I am deserting it I will try on a server with RDMA GPU-Direct but I am certain I will see no gain by using another K20 but loss.
So future trend is to use fast uniform memory both for GPU and CPU. Like GDDR5. For instance this has been introduced in the new Playstation 4. The processors can see GPU memory and other way round. Right now we are limited both by the PCI Express bus and Network bus in MPI.
We are loosing a lot lot lot efficiency with the current architecture. Way too lot.