cudaMalloc with CUDA and MPI Groups giving me trouble

I’m having some trouble allocating memory on CUDA capable devices when using MPI groups.

I can allocate memory no problem on the hosts, but as soon as I use cudaMalloc some devices, not all, return the message
“all CUDA-capable devices are busy or unavailable.”

There is nothing wrong with the device memory allocation code because I am adding MPI groups to some code that has run successfully on 32 T10s. I hope to speed up data transfer between devices by replacing a MPI_Allgather from all 32 processes with one MPI_Gather on a S1070 and MPI_Allgather between 8 S1070s. I thought groups could be a simple way of partitioning the processes on nodes for communication.

If MPI Groups do not work with CUDA, is there a way around this because I can imagine that the problem I am trying to solve is fairly common.

You probably want to use coloring with a communicator rather than groups for something like this. MPI_Comm_Split can be used to create sub-groups of processes, where each subgroup has the same color. If you create one color per physical host and then do context establishment using the rank within a color to select devices, you should get the correct assignments. My standard code for this is written in Python, which probably won’t be much good to you, but Massimiliano Fatica posted a useful prototype for this approach in this thread.

With the split communicator, you can do operations both within the host cpu communicator, which in this case would be local across a single S1070, and then at the internode level, between S1070s. Might be what you are looking for.

EDIT: My failing memory tells me we might have had this very conversation a few times before…

yes, I’ll try Comm_Split. Sounds like it would solve the problem.

I have asked about MPI with CUDA before, but not this particular problem.

The mfatica code showed that the device allocation was not what I thought, so more than one process was being allocated to some devices, hence the busy or unavailable error. When corrected MPI groups work with cudaMalloc. No problems.

I was remembering this thread, which turns out to be pretty much the exact solution you needed in this case. But good you worked it out.