MPI does not detect 2 GPUs MPI program gives "no CUDA-capable device is detected"

I want to make use of the two GPUs on my motherboard and am using MPICH2 to do this.

The process is startd with “mpiexec -n 2 fdtd.exe” and the error detection then tells me “no CUDA-capable device is detcted” and the result is garbage.

The simple code below runs well(correctly) within the VC++9.0 environment.

Using Windows 7, VC9.0,SDK3.2,Toolkit3.2,Drivers 270.61 MPICH2

How can I make MPI detect the GPUs ??? Test code below:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/types.h>
#include <time.h>
#include <assert.h>
#include <cuda.h>
#include “mpi.h”

// includes, project
//#include <cutil_inline.h>

global static void exeyloop(float* ex)

int idx = blockIdx.x*blockDim.x + threadIdx.x;


global static void ehehloop(float* ex)

int idx = blockIdx.x*blockDim.x + threadIdx.x;


void assignEvaluateFdtd1()

int   i,j;

int blockt;  //number of blocks
int gridt;  //number of grids
float *ex_a;
float *ex_d;
cudaError_t cudareturn;
int gpudevice;

ex_a=(float *)malloc(100*sizeof(float));


printf(“Cuda set device %d\n”,cudareturn);
if(cudareturn!=0)printf("\n %s \n",cudaGetErrorString (cudareturn));
if (cudareturn == cudaErrorInvalidDevice)
printf("\n cudaSetDevice returned cudaErrorInvalidDevice");

cudareturn=cudaMalloc((void **) &ex_d, 100*sizeof(float));
printf(“Cuda Malloc to device %d\n”,cudareturn);

cudareturn=cudaMemcpy(ex_d, ex_a, 100*sizeof(float), cudaMemcpyHostToDevice);
printf(“Cuda cudaMemcpy device %d\n”,cudareturn);

 // allocate array dimensions

dim3 dimBlock(blockt,1);
dim3 dimGrid(gridt,1);

exeyloop <<< dimGrid, dimBlock >>> (ex_d);
cudareturn=cudaMemcpy(ex_a, ex_d, 100*sizeof(float), cudaMemcpyDeviceToHost);
printf(“Cuda cudamemcpy back device %d\n”,cudareturn);
printf("\n exey \n");
printf("\n %f ",ex_a[j]);

ehehloop <<< dimGrid, dimBlock >>> (ex_d);
cudaMemcpy(ex_a, ex_d, 100*sizeof(float), cudaMemcpyDeviceToHost);

printf("\n eheh \n");
printf("\n %f ",ex_a[j]);


// end of assignEvaluateFDTD


int main(int argc,char **argv)

int done = 0, n, myid, numprocs, i;
char processor_name[100];
int  namelen;



//the the total elapsed time in ms


return(0) ;


This is apparently a known issue with Windows 7.

I posted on the Mpich2 forum too and got the following answer:

“Use the -localonly flag in the mpiexec on the command line!”

This sorted out the problem. So: to recap, on Windows 7 use

mpiexec -localonly -n 2 progname

and NOT the following:

mpiexec -n 2 progname


I know this is an old thread, though can you please explain how to do the change using MPICH2?

I opened MPIEXEC and under the advanced option I tried adding the flag -localonly, but with no real change to the error message (the same example you describe - 1 by 1).

Further info - and not sure this is the correct thread I should be using as this is NOT related to CUDA, more as a MPICH issue.
Running the console with the -localonly flag resolved the problem BUT how do I use the MPIEXEC Warpper??? and not the console???

Thank you,