Nevermind about CUDA 4.0 RC2 and its ability to control from a single host thread all the system’s devices, I’m using CUDA 3.2. I’m creating a multiGPU application to run on 8 nodes, each of them fitted with 2 Tesla M2050 cards.
I’m using MPI since I’ll need to spawn processes to all nodes, that run asynchronously in respect to other nodes processes, but need to run synchronously in respect to the other process in their node.
MPI defines MPI_Win to achieve mutual exclusion locks on RMA accesses. But that’s not what I’m after, I’d like something like a pthread_mutex_t to get mutual exclusion when I’m copying data to the device, issuing a kernel on the device and finally getting data back to the host. I suppose I could make do with the MPI_Win_lock and MPI_Win_unlock functions, but that locks a memory region so there’s got to be a mutex in there somewhere, which in turn makes the MPI_Win functions just a means to lock and unlock a mutex and a bunch of other stuff that will waste a precious amount of time.
How can I define mutual exclusion envolving my cudaMemcpy and kernel launches in MPI?
Thank you for your time.