If one of MPI threads finds no available unused device, the succeeding cuda runtime api calls fail - which make MPI job fail. Is there any way to detect device query failure? Thanks for any suggestions!
did you try to use something like :
cudaError_t err;
err=cudaSetDevice(mpi_rank);
if(err!=cudaSuccess)
printf("No Cuda device available");
else
*** cuda calculation ***
Hi fcs,
Thank you for your reply! I think your suggestion may be one of the solution.
The situation I had is that there are 4 devices with the GPU - if I submit 5 threads of MPI job to it, the 4 devices will automatically attach to the first 4 threads, the 5th has noway to detect if there is an available device for it until it comes to a first cuda call (such as cudaMalloc) which will certainly fail. In this case it wont tell the causes of failure (device unavailable or memory allocation error). It will be nice if there is a cuda call to detect if there is device available before going to any other cuda call.
Your suggestion may work for me but need to take care of the device vacancy list.
i think " cudasetdeviceflags" can allow you to put device in exclusive mode and then the 5th call to cudasetdevice will fail.