Nvshmem with MPS with numGpu + 1 MPI process

I run a slightly modified version of Nvidia nvshem sample code on a server with 4 GPUs and MPS (Multi-Process Service) using MPI. I need 5 processes running in total.

On initialization of nvshmem_malloc I get the GPU to hang indefinitely.

If I start 8 processes, I get further along, but it hangs after the kernel execution.

We only use nvshmem on one node at a time over pcie / nvlink.

I am looking for pointers to solve what seems to be a race condition or wrong execution order of a collective call.