Invalid Device when using open mpi to run multiple processes on a machine with 8 gpus

I am executing my code on an 8 gpu node with MPS on. I am trying to overload the GPUs by running 21 processes through MPI in this fashion:

mpirun -np 21 ./a.out

This run results in the following error:
call to cuDevicePrimaryCtxRetain returned error 101: Invalid device

When I run this on a machine with only a single gpu, no issues occur and it executes (inefficiently) through MPS correctly.

I am certain that it has to do with how I am calling ACC_INIT

  ACC_NUM = ACC_GET_NUM_DEVICES(ACC_DEVICE_NVIDIA)
  GPUNUM  = MOD(MYID,ACC_NUM)
  CALL ACC_SET_DEVICE(GPUNUM,ACC_DEVICE_NVIDIA)
  CALL ACC_INIT(ACC_DEVICE_NVIDIA)
  ACC_DEV = ACC_GET_DEVICE_NUM(ACC_DEVICE_NVIDIA)

Any help would be appreciated.

Is this a CUDA programming question? It doesn’t look like it. If you are using PGI OpenACC, you might get more expert help by posting your question on the PGI forum. [url]http://www.pgroup.com/userforum/index.php[/url]

There is also an OpenACC section on this board. [url]https://devtalk.nvidia.com/default/board/56/openacc-toolkit/[/url]