Error: Number of CUDA devices (2) is less than MPI processes (4)

Hi, I have the following program multi_node_mutli_gpu.cpp:

```
#include <mpi.h>
#include <cuda_runtime.h>

#define CUDA_RT_CALL(call) \
{ \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        std::cerr << "CUDA Runtime error at " << __FILE__ << ":" << __LINE__ << ": " \
                  << cudaGetErrorString(err) << std::endl; \
        MPI_Abort(MPI_COMM_WORLD, 1); \
    } \
}

int main(int argc, char *argv[]) {
    int rank, size;
    int numDevices;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    cudaGetDeviceCount(&numDevices);

    // Ensure there are enough GPUs for all MPI processes
    if (numDevices < size) {
        if (rank == 0) {
            std::cerr << "Error: Number of CUDA devices (" << numDevices
                      << ") is less than MPI processes (" << size << ")" << std::endl;
        }
        MPI_Finalize();
        return 1;
    }

    // Determine which GPU to use based on MPI rank
    int device = rank % numDevices;
    CUDA_RT_CALL(cudaSetDevice(device));

    cudaDeviceProp prop;
    CUDA_RT_CALL(cudaGetDeviceProperties(&prop, device));

    // Gather information from all nodes to node 0
    char node_name[MPI_MAX_PROCESSOR_NAME];
    int node_name_len;
    MPI_Get_processor_name(node_name, &node_name_len);

    // Print CPU and GPU information from each MPI rank
    std::cout << "Node " << rank << " (on " << node_name << "), GPU " << device << ": " << prop.name << std::endl;
    std::cout << "  Compute capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "  Total global memory: " << prop.totalGlobalMem << " bytes" << std::endl;
    std::cout << "  Memory clock rate: " << prop.memoryClockRate << " kHz" << std::endl;
    std::cout << "  Memory bus width: " << prop.memoryBusWidth << " bits" << std::endl;
    std::cout << "  Peak memory bandwidth (GB/s): " << 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6 << std::endl;

    MPI_Finalize();
    return 0;
}

[vorlket@server cudaprac]$ nvcc -o cuda_code.o -c multi_node_multi_gpu.cpp -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52
[vorlket@server cudaprac]$ mpic++ -o multi_node_multi_gpu_35 cuda_code_.o -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/targets/x86_64-linux/lib -l:libcudart.so
[vorlket@server cudaprac]$ cp multi_node_multi_gpu ~/sharedfolder/multi_node_multi_gpu
[vorlket@server cudaprac]$ mpirun -host server:2,midiserver:2 -np 4 /home/vorlket/sharedfolder/multi_node_multi_gpu

The mpirun gives me Error: Number of CUDA devices (2) is less than MPI processes (4). I have 2 gpus on midiserver and another 2 gpus on server, so 4 in total. I don’t understand why I get this error. Could you please help me understanding what’s going on?

Thanks.

You may need to learn more about both CUDA and MPI in order for this to make any sense.

The printout is arising from here, of course:

When you initiate a program using mpirun, the mpirun “launcher” will initiate the number of processes you specify via -np and will distribute them across the nodes you specify, if you have provided a distribution specification (which you have).

Your mpirun command has requested 4 total processes, distributed across 2 nodes, so 2 processes per node. The code I have previously excerpted strikes me as a little unusual, but certainly not incorrect or illegal, in that it seems to be requiring that each node have as many GPUs as there are MPI processes in total. This would be sensible if the code was written from the perspective that only one node would be used for that program. Specifically, the MPI_COMM_WORLD is a “communicator” which represents a handle to a specific set of MPI ranks. The “WORLD” designator means all MPI ranks associated with your program. So

is returning (in size) the total number of ranks you have launched, which is 4.

So far none of my comments are unique or specific to CUDA, and represent general MPI knowledge. It’s not my intention to spend a lot of time on this forum educating folks about how MPI works.

The next line:

returns the total number of GPUs (in numDevices) on a specific node (whatever node the rank happens to be running on). In your situation, presumably this number is 2, and so the way the code is written, it exits at that point. Hopefully it should be clear that the way that specific snippet of code is written, it will only “not exit” if the number of GPUs (in that node, or in any node) is equal to or greater than the number of ranks in total, not the number of ranks per node.

But ordinary/typical MPI usage would only require 1 GPU per rank per node. So this is an unusual “check” in my view.

It’s further curious code because this line:

Is specifically designed to allow more ranks per node than there are GPUs per node (which is again legal, and not necessarily incorrect).

Since your code in its entirety does nothing useful (other than printing out some GPU spec data), it’s hard to say what is a proper intent. I think I would probably just advise getting rid of the code that I originally excerpted, ie. this statement, in its entirety (i.e. plus its body):

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.