NVSHMEM without mpi, 1 thread for each GPU on a node- howto initialize?

Hi,

I want to start a calculation, where I use multiple nodes, each node has (say) 4 GPUs. The nodes I want to connect with MPI, and within each node, it’s 4 GPUs should communicate with NVSHMEM, for which I intend to start 4 threads of the same process. Is there a way to initialize nvshmem with this setup?

(haveing 4*#nodes MPI processes and defining sub-communicators within MPI (for 4 processes whithin each node) would be a suboptimal solution)

Thanks

NVSHMEM does not support thread based model. PEs have to be different processes.