I used to have access to a cluster with two machines with 4 Tesla C2050 cards each. I have a code to run on this hardware that used MPI and CUDA (of course). Each Tesla card is managed by one MPI thread.
Last week there was a somehow update to the cluster, now I have just one machine with 7 C2070 cards and cuda 4.0 instead of cuda 3.1.
The thing is that now my code is much slower than before. In fact it is much slower than running it with just one card (on this new configuration).
With just one card it has a normal running time.
Do you think that the configuration change may have something to do with this change in speed? Is there any known thing with cuda 4.0 and mpi?
Thanks for your help!