I run a slightly modified version of Nvidia nvshem sample code on a server with 4 GPUs and MPS (Multi-Process Service) using MPI. I need 5 processes running in total.
On initialization of nvshmem_malloc I get the GPU to hang indefinitely.
If I start 8 processes, I get further along, but it hangs after the kernel execution.
We only use nvshmem on one node at a time over pcie / nvlink.
I am looking for pointers to solve what seems to be a race condition or wrong execution order of a collective call.