Developing CUDA applications to a cluster.


I have an application that makes use of CUDA to process some data on a single card in 5 hours. Performance is computation latency limited (not memory latency). We’d like to purchase a machine with 8 Tesla cards to speed up this computation even further. I can modify my code to break the data up into 8 chunks and make use of setCudaDevice() etc to do this. However, what would be even better is to buy 2 networked machines with 16 cards total. And, furthermore since I won’t be using it 24/7 that’s a lot of computing wasted, maybe somebody else would like a shot at the GPUs.

So I’m basically talking about HPC clusters, but I find the information available very confusing and most of it proprietary (Torque job scheduler seems to be the only open source one).

How do I need to modify my existing CUDA code to run it as a job on a “typical” cluster?
Do I have to compile the CUDA code as executable and make it available on each node in the cluster? Or are there fancy solutions available that will perhaps make a “virtual” CUDA device on your host machine but memory allocation, H2D copies and kernel submissions go and get executed on some device in the “cloud”? If such a solution exists, any ballpark figure on costs?