cudaMalloc with CUDA and MPI Groups giving me trouble

chrismc · March 6, 2011, 10:50am

I’m having some trouble allocating memory on CUDA capable devices when using MPI groups.

I can allocate memory no problem on the hosts, but as soon as I use cudaMalloc some devices, not all, return the message
“all CUDA-capable devices are busy or unavailable.”

There is nothing wrong with the device memory allocation code because I am adding MPI groups to some code that has run successfully on 32 T10s. I hope to speed up data transfer between devices by replacing a MPI_Allgather from all 32 processes with one MPI_Gather on a S1070 and MPI_Allgather between 8 S1070s. I thought groups could be a simple way of partitioning the processes on nodes for communication.

If MPI Groups do not work with CUDA, is there a way around this because I can imagine that the problem I am trying to solve is fairly common.

avidday · March 6, 2011, 11:05am

You probably want to use coloring with a communicator rather than groups for something like this. MPI_Comm_Split can be used to create sub-groups of processes, where each subgroup has the same color. If you create one color per physical host and then do context establishment using the rank within a color to select devices, you should get the correct assignments. My standard code for this is written in Python, which probably won’t be much good to you, but Massimiliano Fatica posted a useful prototype for this approach in this thread.

With the split communicator, you can do operations both within the host cpu communicator, which in this case would be local across a single S1070, and then at the internode level, between S1070s. Might be what you are looking for.

EDIT: My failing memory tells me we might have had this very conversation a few times before…

chrismc · March 6, 2011, 11:34am

You probably want to use coloring with a communicator rather than groups for something like this. MPI_Comm_Split can be used to create sub-groups of processes, where each subgroup has the same color. If you create one color per physical host and then do context establishment using the rank within a color to select devices, you should get the correct assignments. My standard code for this is written in Python, which probably won’t be much good to you, but Massimiliano Fatica posted a useful prototype for this approach in this thread.

With the split communicator, you can do operations both within the host cpu communicator, which in this case would be local across a single S1070, and then at the internode level, between S1070s. Might be what you are looking for.

EDIT: My failing memory tells me we might have had this very conversation a few times before…

yes, I’ll try Comm_Split. Sounds like it would solve the problem.

I have asked about MPI with CUDA before, but not this particular problem.

chrismc · March 6, 2011, 4:06pm

The mfatica code showed that the device allocation was not what I thought, so more than one process was being allocated to some devices, hence the busy or unavailable error. When corrected MPI groups work with cudaMalloc. No problems.

avidday · March 6, 2011, 4:52pm

I was remembering this thread, which turns out to be pretty much the exact solution you needed in this case. But good you worked it out.

Topic		Replies	Views
MPI causing trouble in memory allocation? CUDA Programming and Performance	5	11864	November 28, 2009
Using a cluster of S1070s CUDA Programming and Performance	1	1136	November 27, 2009
CUDA/MPI interoperability problem CUDA Programming and Performance	3	2061	December 20, 2013
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32418	December 13, 2010
cudaMalloc and threads "invalid device pointer" error CUDA Programming and Performance	4	5446	June 26, 2007
CUDA Cluster Programming Any1 Experienced? CUDA Programming and Performance	12	7053	December 5, 2008
Asynchronous cudaMalloc CUDA Programming and Performance	3	11491	July 2, 2012
Mutual exclusion MPI Windows CUDA Programming and Performance	6	3535	May 18, 2011
Possible direct memcpy between CPU (multiple process on one node) and GPU (unified memory on one card) under MPI? CUDA Programming and Performance openmpi	6	216	June 7, 2024
Cuda aware mpi, derived data type issue CUDA Programming and Performance	5	90	August 14, 2024

cudaMalloc with CUDA and MPI Groups giving me trouble

Related topics