Invalid Device when using open mpi to run multiple processes

I am executing my code on an 8 gpu node with MPS on. I am trying to overload the GPUs by running 21 processes through MPI in this fashion:

mpirun -np 21 ./a.out

This run results in the following error:
call to cuDevicePrimaryCtxRetain returned error 101: Invalid device

When I run this on a machine with only a single gpu, no issues occur and it executes (inefficiently) through MPS correctly.

I am certain that it has to do with how I am calling ACC_INIT


Any help would be appreciated.

If you have 8 GPUs on your one platform and wish to use them all
simultaneously, the usual method is to run an OpenMP parallel section
in 8 threads on the CPU, where each thread assigns a different GPU, and then runs the GPU code on the assigned element. You can sync all the work at the end of the OpenMP section.

will tell you what the compilers can see (8 GPUs?), to make sure
the compilers can access them.

A multi-process MPI program has to know which GPUs are available,
or it may end up just waiting for processes to end.

The GPUs do not do multi-tasking, they only run on job at a time. I am not sure overloading processes on the same platform to access individual GPUs will be successful.