Error: Number of CUDA devices (2) is less than MPI processes (4)

vorlket · August 18, 2024, 12:25pm

Hi, I have the following program multi_node_mutli_gpu.cpp:

```
#include <mpi.h>
#include <cuda_runtime.h>

#define CUDA_RT_CALL(call) \
{ \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        std::cerr << "CUDA Runtime error at " << __FILE__ << ":" << __LINE__ << ": " \
                  << cudaGetErrorString(err) << std::endl; \
        MPI_Abort(MPI_COMM_WORLD, 1); \
    } \
}

int main(int argc, char *argv[]) {
    int rank, size;
    int numDevices;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    cudaGetDeviceCount(&numDevices);

    // Ensure there are enough GPUs for all MPI processes
    if (numDevices < size) {
        if (rank == 0) {
            std::cerr << "Error: Number of CUDA devices (" << numDevices
                      << ") is less than MPI processes (" << size << ")" << std::endl;
        }
        MPI_Finalize();
        return 1;
    }

    // Determine which GPU to use based on MPI rank
    int device = rank % numDevices;
    CUDA_RT_CALL(cudaSetDevice(device));

    cudaDeviceProp prop;
    CUDA_RT_CALL(cudaGetDeviceProperties(&prop, device));

    // Gather information from all nodes to node 0
    char node_name[MPI_MAX_PROCESSOR_NAME];
    int node_name_len;
    MPI_Get_processor_name(node_name, &node_name_len);

    // Print CPU and GPU information from each MPI rank
    std::cout << "Node " << rank << " (on " << node_name << "), GPU " << device << ": " << prop.name << std::endl;
    std::cout << "  Compute capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "  Total global memory: " << prop.totalGlobalMem << " bytes" << std::endl;
    std::cout << "  Memory clock rate: " << prop.memoryClockRate << " kHz" << std::endl;
    std::cout << "  Memory bus width: " << prop.memoryBusWidth << " bits" << std::endl;
    std::cout << "  Peak memory bandwidth (GB/s): " << 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6 << std::endl;

    MPI_Finalize();
    return 0;
}

[vorlket@server cudaprac]$ nvcc -o cuda_code.o -c multi_node_multi_gpu.cpp -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52
[vorlket@server cudaprac]$ mpic++ -o multi_node_multi_gpu_35 cuda_code_.o -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/targets/x86_64-linux/lib -l:libcudart.so
[vorlket@server cudaprac]$ cp multi_node_multi_gpu ~/sharedfolder/multi_node_multi_gpu
[vorlket@server cudaprac]$ mpirun -host server:2,midiserver:2 -np 4 /home/vorlket/sharedfolder/multi_node_multi_gpu

The mpirun gives me Error: Number of CUDA devices (2) is less than MPI processes (4). I have 2 gpus on midiserver and another 2 gpus on server, so 4 in total. I don’t understand why I get this error. Could you please help me understanding what’s going on?

Thanks.

Robert_Crovella · August 18, 2024, 9:01pm

You may need to learn more about both CUDA and MPI in order for this to make any sense.

The printout is arising from here, of course:

vorlket:

    if (numDevices < size) {
        if (rank == 0) {
            std::cerr << "Error: Number of CUDA devices (" << numDevices
                      << ") is less than MPI processes (" << size << ")" << std::endl;
        }

When you initiate a program using mpirun, the mpirun “launcher” will initiate the number of processes you specify via -np and will distribute them across the nodes you specify, if you have provided a distribution specification (which you have).

Your mpirun command has requested 4 total processes, distributed across 2 nodes, so 2 processes per node. The code I have previously excerpted strikes me as a little unusual, but certainly not incorrect or illegal, in that it seems to be requiring that each node have as many GPUs as there are MPI processes in total. This would be sensible if the code was written from the perspective that only one node would be used for that program. Specifically, the MPI_COMM_WORLD is a “communicator” which represents a handle to a specific set of MPI ranks. The “WORLD” designator means all MPI ranks associated with your program. So

is returning (in size) the total number of ranks you have launched, which is 4.

So far none of my comments are unique or specific to CUDA, and represent general MPI knowledge. It’s not my intention to spend a lot of time on this forum educating folks about how MPI works.

The next line:

returns the total number of GPUs (in numDevices) on a specific node (whatever node the rank happens to be running on). In your situation, presumably this number is 2, and so the way the code is written, it exits at that point. Hopefully it should be clear that the way that specific snippet of code is written, it will only “not exit” if the number of GPUs (in that node, or in any node) is equal to or greater than the number of ranks in total, not the number of ranks per node.

But ordinary/typical MPI usage would only require 1 GPU per rank per node. So this is an unusual “check” in my view.

It’s further curious code because this line:

Is specifically designed to allow more ranks per node than there are GPUs per node (which is again legal, and not necessarily incorrect).

Since your code in its entirety does nothing useful (other than printing out some GPU spec data), it’s hard to say what is a proper intent. I think I would probably just advise getting rid of the code that I originally excerpted, ie. this statement, in its entirety (i.e. plus its body):

system · September 1, 2024, 9:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about CUDA+MPI Legacy PGI Compilers	3	2632	March 13, 2018
cudaErrorNoDevice when submitting an MPI job to multiple nodes CUDA Setup and Installation	0	325	November 14, 2017
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32436	December 13, 2010
Example on MPI + CUDA on Two CPU and Two GPU node CUDA Programming and Performance	1	3979	September 7, 2011
ERROR: Not enough GPUs on node CUDA Programming and Performance	1	749	September 13, 2017
How to run these sample multi-gpu programs CUDA Programming and Performance	6	642	July 18, 2024
Invalid Device when using open mpi to run multiple processes Legacy PGI Compilers	1	2446	August 4, 2017
Running CUDA-Fortran on multiple GPU nodes nvc, nvc++ and nvfortran	4	838	March 12, 2021
Invalid Device when using open mpi to run multiple processes on a machine with 8 gpus CUDA Programming and Performance	1	622	August 4, 2017
Multi MPI ranks share one GPU CUDA Programming and Performance	3	975	August 26, 2014

Error: Number of CUDA devices (2) is less than MPI processes (4)

Related topics