I’m having some trouble allocating memory on CUDA capable devices when using MPI groups.
I can allocate memory no problem on the hosts, but as soon as I use cudaMalloc some devices, not all, return the message
“all CUDA-capable devices are busy or unavailable.”
There is nothing wrong with the device memory allocation code because I am adding MPI groups to some code that has run successfully on 32 T10s. I hope to speed up data transfer between devices by replacing a MPI_Allgather from all 32 processes with one MPI_Gather on a S1070 and MPI_Allgather between 8 S1070s. I thought groups could be a simple way of partitioning the processes on nodes for communication.
If MPI Groups do not work with CUDA, is there a way around this because I can imagine that the problem I am trying to solve is fairly common.