Any portable approaches using GPUs on a distributed system?

I’ve already set up a small Linux cluster with shared storage space (via nfs, autofs and ldap), wondering if there is an easy way to launch simulations using all GPUs on different nodes.

I know MPI is one of the options, but code maintenance seems a little bit more verbose. I know there are also special cluster OSes, such as ScaleMP/vSMP/vNUMA, to treat the distributed system into a SMP system, but I am wondering if CUDA has any portable interface for remote execution on a distributed system?

I have limited experience but have always found MPI straightforward to use, and the model of parallelism it is based on matches CUDA’s own approach to parallelism quite nicely, IMHO. MPI does not strike me as excessively verbose, and the positive trade-off of what verbosity there is, is programmer’s control over communication patterns, which usually is a good thing from a performance angle.

From a software maintenance perspective, one advantage of MPI is that it is widely used (independent of any GPU acceleration) and that many people are familiar with it.