How to run these sample multi-gpu programs

Hi, I have the following sample multi-gpu programs https://github.com/NVIDIA/multi-gpu-programming-models, specifically, the mpi and mpi_overlap programs, and I’d like to know how to compile and run them on the following set of two nodes:

  1. two compute_35 gpus
  2. one compute_52 gpu

Information is given at that link. Particularly the requirements, building, and run sections. For example, to get started with building the mpi variant, you would git clone the repo, then cd multi-gpu-programming-models/mpi. At that point you’ll want to change the gencode flags to match your GPU architecture(s), then make. You’ll need to make sure you have satisfied the requirements for that build to be successful.

Hi Robert, I see the run sections in the link. It doesn’t seem to say how to specify specific nodes and specific gpus to run the programs on though. Please tell me if I missed something. Thanks.

As a first step, my suggestion would be that you learn how to distribute a MPI problem to those nodes (ignoring GPU/CUDA). Once you do that, you will learn how to specify nodes to run your codes on. And it has basically nothing to do with CUDA, you can find lots of resources on the web to learn how to use MPI.

Typically with MPI, you would have each node having a name, and you would provide a list (via perhaps a file) to your mpirun command, to identify how to distribute MPI ranks to nodes. Again, my suggestion is to use available resources to learn how to use MPI.

For those two examples you have mentioned, that is sufficient for multi-GPU single node case. I haven’t checked, but it may also work in the multi-node case. Anyway, that is how those examples are set up.

In the general case, the remaining issue is how to associate MPI ranks to GPUs. That is canonically done via one of two methods:

  1. The MPI rank code determines its logical rank number, then chooses a GPU based on that. An example is here.

  2. The MPI rank is designed to use only a single GPU, and the GPU it will use is determined by appropriate use of CUDA_VISIBLE_DEVICES, in the launch script. here is an example.

This blog and part 2 may also be of interest. Here are some additional GTC resources: 1 2

Also, there are cuda sample codes that cover multi-GPU: simpleMultiGPU, simpleMPI

I have the following code which I compiled on server (192.168.1.3) and midiserver (192.168.1.4) with:

[vorlket@server cudaprac]$ nvcc -o cuda_code_35.o -c multi_node_multi_gpu.cpp -arch=sm_35
[vorlket@server cudaprac]$ mpic++ -o multi_node_multi_gpu_35 cuda_code_35.o -I/opt/cuda/targets/x86_64-linux/include -L/opt/cuda/targets/x86_64-linux/lib -l:libcudart.so
[vorlket@server cudaprac]$ cp multi_node_multi_gpu_35 ~/sharedfolder/multi_node_multi_gpu_35

multi_node_mutli_gpu.cpp:

#include <iostream>
#include <mpi.h>
#include <cuda_runtime.h>

#define CUDA_RT_CALL(call) \
{ \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        std::cerr << "CUDA Runtime error at " << __FILE__ << ":" << __LINE__ << ": " \
                  << cudaGetErrorString(err) << std::endl; \
        MPI_Abort(MPI_COMM_WORLD, 1); \
    } \
}

int main(int argc, char *argv[]) {
    int rank, size;
    int numDevices;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    cudaGetDeviceCount(&numDevices);

    // Ensure there are enough GPUs for all MPI processes
    if (numDevices < size) {
        if (rank == 0) {
            std::cerr << "Error: Number of CUDA devices (" << numDevices
                      << ") is less than MPI processes (" << size << ")" << std::endl;
        }
        MPI_Finalize();
        return 1;
    }

    // Determine which GPU to use based on MPI rank
    int device = rank % numDevices;
    CUDA_RT_CALL(cudaSetDevice(device));

    cudaDeviceProp prop;
    CUDA_RT_CALL(cudaGetDeviceProperties(&prop, device));

    // Gather information from all nodes to node 0
    char node_name[MPI_MAX_PROCESSOR_NAME];
    int node_name_len;
    MPI_Get_processor_name(node_name, &node_name_len);

    // Print CPU and GPU information from each MPI rank
    std::cout << "Node " << rank << " (on " << node_name << "), GPU " << device << ": " << prop.name << std::endl;
    std::cout << "  Compute capability: " << prop.major << "." << prop.minor << std::endl;
    std::cout << "  Total global memory: " << prop.totalGlobalMem << " bytes" << std::endl;
    std::cout << "  Memory clock rate: " << prop.memoryClockRate << " kHz" << std::endl;
    std::cout << "  Memory bus width: " << prop.memoryBusWidth << " bits" << std::endl;
    std::cout << "  Peak memory bandwidth (GB/s): " << 2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6 << std::endl;

    MPI_Finalize();
    return 0;
}

When I run it with [vorlket@server cudaprac]$ mpirun -host server:2,midiserver:2 -np 4 /home/vorlket/sharedfolder/multi_node_multi_gpu_35, I get the following:

[server:01853] No HIP capabale device found. Disabling component.
[server:01854] No HIP capabale device found. Disabling component.
/home/vorlket/sharedfolder/multi_node_multi_gpu_35: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
/home/vorlket/sharedfolder/multi_node_multi_gpu_35: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
prterun detected that one or more processes exited with non-zero status,
thus causing the job to be terminated. The first process to do so was:

   Process name: [prterun-server-1849@1,2] Exit code:    127
--------------------------------------------------------------------------

Compiling it individually on each server with its respective architecture and running them works.

I appreciate if someone guides me how to compile the program on server and put it in the sharedfolder to run on the server (cuda11.1 installed) and the midiserver (cuda12 installed) at the same time. Also, please let me know if this is impossible and I need to install cuda11.1 on both servers.

Thanks.

Each node will certainly require a proper GPU driver install, that is unavoidable if you want to use the GPUs on each node.

Certainly installing the full CUDA toolkit (on each node) would be one way to solve this problem.

Otherwise you would need to get the libcudart available on each node, somehow. You could just copy it there, and make sure it is in the path. You could also set up e.g. a NFS share, that has everything needed to run the code. Make the NFS share available on each node in your cluster. The libcudart from your build on the CUDA 11.1 machine is usable on the CUDA 12 machine, but not vice-versa. So build the code on the lower CUDA toolkit version.

Running/managing clusters (e.g. setting up a NFS share) is documented in many places on the web, and not something I would try to respond to on this forum.

I think the usual advice for any beowulf-style cluster is that the cluster administrator should make sure the software install on each node is identical. So having some nodes with CUDA 11 and some nodes with CUDA 12 is going to lead to problems like this. Even if you sort this one out, say, by distributing the needed CUDA 11 libcudart manually, someday you might try to run a code that requires some other library, such as a cufft code, for example, and you will be revisiting this. So as a matter of sanity, or efficient use of your time, you might want to make sure all your nodes have the same software install.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.