about multi GPU control

Hello !

I’am studying now to control multi GPU by CUDA.

But I have no idea how to control them.

I reference this site (https://github.com/zchee/cuda-sample/blob/master/0_Simple/simpleMultiGPU/simpleMultiGPU.cu), but there is a codes that used for.

Instead, that codes inspired me, so I use cuda + mpi to initialize device ID by rank and execute at the same time.

But My codes used only one GPU… :(

How can I control GPUs parallely ?

Here is my code

#include <stdio.h>
#include <curand.h>
#include <mpi.h>

#define XBLOCKSIZE 10
#define XGRIDSIZE 10

int main(int argc, char** argv)  {
        int mpi_error, rank, numtasks;
        mpi_error = MPI_Init(&argc, &argv);

        if(mpi_error != MPI_SUCCESS) {
                printf("MPI error has occured.\n");
                return 0;

        MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        //rank 0 print problem size
        if(rank == 0)
                printf("<<<%d,%d>>>\n", XGRIDSIZE, XBLOCKSIZE);

        int num_of_gpus;


        printf("number of gpus : %d\n", num_of_gpus);
        printf("My rank : %d\n", rank);

        if(numtasks > num_of_gpus){
                printf("Too many process than number of gpus\n");
                return 1;

        int devID = rank;

        cudaError_t error;
        cudaDeviceProp deviceProp;
        error = cudaGetDevice(&devID);

        if(error != cudaSuccess){
                printf("cudaGetDevice returned error %s\n", cudaGetErrorString(error));

        error = cudaGetDeviceProperties(&deviceProp, devID);

        if(error != cudaSuccess){
                printf("cudaGetDeviceProperties returned error %s\n", cudaGetErrorString(error));
                printf("GPU BUS ID : %d\n", deviceProp.pciBusID);
                printf("GPU Device ID : %d\n", deviceProp.pciDeviceID);

        return 0;

Results :
number of gpus : 2
My rank : 0
number of gpus : 2
My rank : 1
GPU Device ID : 0
GPU Device ID : 0

I think you’re making it more complicated than it needs to be. You can easily use OpenMP.

  1. Set number of OpenMP threads to the number of GPUs you have or want to use.
  2. Create a OpenMP parallel region.
  3. Inside the region you need to get the OpenMP thread id.
  4. In each OpenMP thread you must set the device. This is very important because it make the thread aware of the correct CUDA context.
  5. You can use a struct to pass data to different threads.
omp_set_num_threads( numDevices );
#pragma omp parallel
    int ompId = omp_get_thread_num( );

    // We must set the device in each thread
    // so the correct CUDA context is visible
    checkCudaErrors( cudaSetDevice( ompId ) );
    checkCudaErrors( cudaStreamCreate( &streams[ompId] ) );


    int offset { sizeData * ompId }
    void *args[] { &offset, &struct[ompId].data };

        cudaLaunchKernel( reinterpret_cast<void *>( &doMath ), blocksPerGrid, threadPerBlock, args, 0, streams[ompId] ) );


Here’s an additional link on OpenMP + CUDA https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

Your MPI code is broken because you never set the device to the rank.

All threads/processes start out with an assumed device ID of zero. In order to change this, it is mandatory to make a call to cudaSetDevice().

Your code never does that.

You can fix it by making this change:

cudaDeviceProp deviceProp;
error = cudaSetDevice(devID);  // add this line
error = cudaGetDevice(&devID);

I appreciate about your reply quickly!!

I solve it :)