Proper way to call CUDA function within MPI code

Hi,

For testing if MPI can be used for sharing workload between MIG partitions, I wrote a simple MPI-CUDA examples. This is just a further simplified version of simpleMPI provided by NVIDIA.

  • wrapper.h
void kernel_launch(float *A, float *B, float *C, int N);
  • kernel.cu
#include <cuda.h>
#include <cuda_runtime.h>
#define BLOCK_SIZE 256

__global__ void vecAdd(float *A, float *B, float *C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if ( i < N ) {
        C[i] = A[i] + B[i];
    }
}

void kernel_launch(float *A, float *B, float *C, int N) {
    float *A_d, *B_d, *C_d;
    cudaMallocManaged(&A, N * sizeof(float));
    cudaMallocManaged(&B, N * sizeof(float));
    cudaMallocManaged(&C, N * sizeof(float));

    cudaMemcpy(A_d, A, N * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(B_d, B, N * sizeof(float), cudaMemcpyHostToDevice);
    
    vecAdd<<<dim3((unsigned int)ceil((float)N/(float)BLOCK_SIZE)), dim3(BLOCK_SIZE)>>>(A_d, B_d, C_d, N);
    
    cudaMemcpy(C, C_d, N * sizeof(float), cudaMemcpyDeviceToHost);
    
    cudaFree(A_d);
    cudaFree(B_d);
    cudaFree(C_d);

    cudaDeviceSynchronize(); 
}
  • main.c
#include <stdlib.h>
#include <mpi.h>
#include <cuda_runtime.h>
#include "wrapper.h"

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    cudaSetDevice(rank % size); 

    int N = 10000; 
    
    float *A = (float*) malloc(N * sizeof(float));
    float *B = (float*) malloc(N * sizeof(float));
    float *C = (float*) malloc(N * sizeof(float));

    for (int i = 0; i < N; i++) {
        A[i] = i; 
        B[i] = i * 2; 
    }

    kernel_launch(A, B, C, N); 

    MPI_Finalize();

    return 0;
}

Despite having declared a prototype for kernel_launch() in a header file and included said file in main.c, I got the following error when linking object files:

$ mpicc -c main.c 
$ nvcc -c kernel.cu
$ mpicc kernel.o main.o -L/apps/cuda/11.4/lib64 -lcudart -lstdc++
main.o: In function `main':
main.c:(.text+0x10b): undefined reference to `kernel_launch'
collect2: error: ld returned 1 exit status

In case my eyes deceived me:

$ grep 'kernel_launch' * 
kernel.cu:void kernel_launch(float *A, float *B, float *C, int N) {
main.c:    kernel_launch(A, B, C, N); 
wrapper.h:void kernel_launch(float *A, float *B, float *C, int N);

From examining similar samples, the basic ideas are:

  • Separating CUDA codes and MPI C codes (kernel.cu vs main.c)
  • Including cuda kernel and wrapper in same file (kernel.cu)
  • Use header file for wrapper prototype (wrapper.h)

I am at loss what cause this is issue. Any insight is much appreciated.

Thanks.

you’ll need to learn either how to mix C++ and C style linkage, or else convert your main.c routine to a C++ style routine and use e.g. mpicxx instead of mpicc to compile.

Your C++ compilation with nvcc creates a mangled name. That mangled name doesn’t look like kernel_launch exactly, and so the C-style name that the mpicc compilation is looking for, can’t be found.

You could try using extern "C". There are numerous questions on public forums about how to do this, and this issue is not unique or specific to CUDA.

Hi Robert,

Thanks for pointing me to right direction.
I have solved the issue following your suggestions:

  • Using extern “C” in header file:
    extern "C" { 
        void kernel_launch(float *A, float *B, float *C, int N);
    } 
    
  • Rename main.c to main.cpp

I hope you don’t mind me continuing here instead of opening new thread.
For testing purpose, I let each MPI ranks print out their corresponding UUID.

  • When requesting two ‘GPUs’, either physical or MIG, via Slurm:
    $ echo $CUDA_VISIBLE_DEVICES
    0,1 
    
  • With two physical GPUs:
    $ mpirun -np 2 ./hello.x  
    Hello from rank 1 on GPU 1 (9529b44b5fd6f4f667495b37afdf5ea8) 
    Using 2 MPI Ranks and 2 GPUs
    Hello from rank 0 on GPU 0 (214aa97fecba45524bd75578e3a86bcd) 
    
  • With two MIG partitions:
    $ nvidia-smi -L 
    GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-978b5d00-eeec-7739-e525-1c8035af8318)
    MIG 1g.10gb     Device  0: (UUID: MIG-932fdc04-fd69-578e-95a3-fb1d63095dc9)
    MIG 1g.10gb     Device  1: (UUID: MIG-491c77a9-9a7d-5873-9f06-a4008832a62b)
    
    $ mpirun -np 2 ./hello.x  
    Hello from rank 1 on GPU 1 (00000000000000000000000000000000) 
    Using 2 MPI Ranks and 1 GPUs
    Hello from rank 0 on GPU 0 (978b5d00eeec7739e5251c8035af8318) 
    

So at software levels, there are two devices, but in case of MIG

  • cudaGetDeviceCount returns 1.
  • Rank-based device ID assignment such as dev_id = rank % size is possible, but
    • cudaGetDeviceProperties() only catches the parent GPU, ie. GPU-978b5d00eeec7739e5251c8035af8318
    • The second dev_id is essentially null.
  • When running real workload such as Quantum Espresso on two MIGs, only one MIG is used.

So if I understand correctly:

  • Each process can utilize only one MIGs.
  • In case of MPI, thread (rank 0), see only one MIG and cannot coordinate with other MIGs.

Is this understanding above correct?

The issue with MIG is that the CUDA runtime only exposes a single MIG instance per process, regardless of how many MIG instances you have or how many you enumerate in CUDA_VISIBLE_DEVICES. What you need to do to have a separate MIG slice per MPI process is to carefully control the setting of CUDA_VISIBLE_DEVICES per MPI rank/process. It needs to be different for each MPI rank, and you need to use that env var to expose one and only one unique MIG slice per MPI rank.

This one MIG slice per process is a MIG peculiarity. Please read the MIG user guide.

Thank you for clarification.

Manually assigning CUDA_VISIBLES_DEVICES to each rank gives expected result.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.