What is the NVSHMEM configuration setting to achieve optimal performance on multi-node HPC cluster

NVSHMEM has provided so many optional configuration choices, which is very confusing for HPC practitioners to recognize the optimal setting they should choose based on their HPC cluster hardware and software stack.
Personally speaking, in my HPC cluster, each node has 8 GPUs connected with NVLINKs. All nodes are connected with Infinite Band. Driver kernel nvidia-peermem is also installed. It would be such a great favor if you could provide the best setting to achieve optimal NVSHMEM performance, especially for multi-node cases.

For HPC clusters, libfabric is a very good choice.

However, there are many new software characteristics proposed by Nvidia. For example, I want to use GPUdirect RDMA and gdrcopy to further accelerate my nvshmem with fast cross-node communication. Simply applying libfabric seems not work.

  1. Libfabric with a vendor-specific provider, CXI for HPE/Cray supercomputers, is usually more performant as those (proprietary, although need to be) implementations expose hardware-specific optimizations for the NIC. Yes, libfabric over Slingshot 11 uses GPUDirect RDMA as otherwise, it would be incompatible with NVSHMEM’s hardware requirements. Note that libfabric is only a software abstraction, support for GPUDirect RDMA also requires compatible hardware, of which Slingshot 11 complies. You would need to investigate whether your hardware meets this criterion.

  2. gdrcopy is only needed for IBRC and UCX. IBGDA does not require gdrcopy (I believe) and yet outperforms IBRC so you need not worry about this.

As long as you have compatible inter-node hardware, preferably one that provides a libfabric provider implementation, then libfabric is the more favorable choice for inter-node communication.