Mutual exclusion MPI Windows

Nevermind about CUDA 4.0 RC2 and its ability to control from a single host thread all the system’s devices, I’m using CUDA 3.2. I’m creating a multiGPU application to run on 8 nodes, each of them fitted with 2 Tesla M2050 cards.

I’m using MPI since I’ll need to spawn processes to all nodes, that run asynchronously in respect to other nodes processes, but need to run synchronously in respect to the other process in their node.

MPI defines MPI_Win to achieve mutual exclusion locks on RMA accesses. But that’s not what I’m after, I’d like something like a pthread_mutex_t to get mutual exclusion when I’m copying data to the device, issuing a kernel on the device and finally getting data back to the host. I suppose I could make do with the MPI_Win_lock and MPI_Win_unlock functions, but that locks a memory region so there’s got to be a mutex in there somewhere, which in turn makes the MPI_Win functions just a means to lock and unlock a mutex and a bunch of other stuff that will waste a precious amount of time.

How can I define mutual exclusion envolving my cudaMemcpy and kernel launches in MPI?

Thank you for your time.

You can use MPI_COMM_SPLIT to form a “colour” (sub-communicator) for each node, then use the usual synchronization MPI primtitive at the scope of each colour, making processes (and hence GPUs) on the same host synchronous, while each colour is asynchronous with respect to all the other colours. Commands issued at the scope of the original communicator synchronize all nodes as you would expect.

Another alternative is to go hybrid and run a single MPI process per node, and use threading within the node. That way you can have the explicit mutex you are looking for without using MPI at all.

Thing is, POSIX threads programming requires an additional effort that MPI does not. In fact, I started my application on a single GPU approach, then jumped to pthreads as you said, then wondered why not MPI from the beginning, since I would eventually integrate all nodes in the application.

Tell me, what is the usual synchronization primitive in MPI? 95% of MPI is still new to me…

That was actual a typo, it should have been primitives, because there are more than one. MPI_BARRIER is the canonical blocking sync, but any of the blocking broadcast mechanisms or point to point copies can also be used to provide synchronization. If you use the local communicator, then you get synchronization between MPI processes holding GPUs on the same physical compute node.

It isn’t immediately obvious why you think you need a mutex in this case anyway. If each MPI process is holding a unique GPU context on its own device (ie the standard pre CUDA 4.0 multi GPU model), then the only synchronization points need to be one sided or point-to-point data exchange between processes, doesn’t it? Each individual MPI process doesn’t need need a mutex to perform on its own GPU, does it? Or am I missing something?

I think not, or at least ideally you’re right. Furthermore, in my specific case processes don’t exchange data among them, each one is loaded with a unique set of data that is generated within each process. My doubt is related to previous experiments with multiGPU.

I never tried multiple GPU contexts per node on CUDA 3.#, my previous experiments with pthreads and CUDA were on version 2.3 (although the standard model for multiple GPU still holds, I think) and I had conflicts with PCIe bus sharing, hence the mutual exclusion. I realized that user enforced synchronization allowed better performance than OS default scheduling. That’s why I’m not very thrilled about this multiple GPU thing, although my results and assumptions about multiGPU could (they are in fact) be biased from a nForce 200 motherboard with only a bus but with 4 PCIe slots. Add kernels that take about 10 to 20ms and overheads of 60ms for each thread to control the bus and you have a speedup less than 1 when you should be having at least 1 with GPU count increase.

However, it seems that my application running on multiple nodes and on both cards per node is not suffering any penalty at all in a NEC workstation. Might be related to a PCIe bus dedicated to each slot, or efficient OS scheduling. In the former case MPI_Barrier before cudaMemcpy+kernel+cudaMemcpy won’t make any impact on the performance, on the latter let’s see how performance gets punished.

[I sometimes wonder why does a lab buy 3 Tesla C1060 and fits them onto a motherboard with only a PCIe bus… It would have been best to build a Beowulf cluster with 3 nodes, InfiniBand or not]

I must admit our lab went with the latter option - we have many single GPU “beowulf” style nodes. It makes more sense to me to have more PCI-e buses and host memory channels, even if the fabric between nodes is slow, than the other way. Having said that, we also have a few dual and triple GPU workstations which are used for development and workloads where multigpu with threads works better than MPI. Horses for courses, as they say.

[Thoughtful lab policy of yours, I must say]

I also enjoy the Seymour Cray phrase:

“What would you rather have to plow a field — two strong oxen or 1,024 chickens?”

But sometimes it seems that adding an extra 1,024 chickens won’t interact very well with the other 1,024 chicken…on nForce 200s boards!

You were right, avviday! No sync needed apart from the MPI_Barrier to coherently time all nodes. I get a 15.7 speedup when increasing the GPU cards to 16 running on MPI. This is very very good, but isn’t actually an astonishing achievement, it just behaved as it should. I was biased by my poor old nForce 200 motherboard.

Since I only time the execution and data exchanges I don’t see a reason to port this code to pthreads, moreover since I don’t need any mutex handling, do you?

Thank you for your help, you’ve been most helpful. Great success!