What is the NVSHMEM configuration setting to achieve optimal performance on multi-node HPC cluster

2509957071 · December 23, 2024, 8:23am

NVSHMEM has provided so many optional configuration choices, which is very confusing for HPC practitioners to recognize the optimal setting they should choose based on their HPC cluster hardware and software stack.
Personally speaking, in my HPC cluster, each node has 8 GPUs connected with NVLINKs. All nodes are connected with Infinite Band. Driver kernel nvidia-peermem is also installed. It would be such a great favor if you could provide the best setting to achieve optimal NVSHMEM performance, especially for multi-node cases.

Osayamen · January 2, 2025, 2:10am

For HPC clusters, libfabric is a very good choice.

2509957071 · January 2, 2025, 2:58am

However, there are many new software characteristics proposed by Nvidia. For example, I want to use GPUdirect RDMA and gdrcopy to further accelerate my nvshmem with fast cross-node communication. Simply applying libfabric seems not work.

Osayamen · January 2, 2025, 3:13am

Libfabric with a vendor-specific provider, CXI for HPE/Cray supercomputers, is usually more performant as those (proprietary, although need to be) implementations expose hardware-specific optimizations for the NIC. Yes, libfabric over Slingshot 11 uses GPUDirect RDMA as otherwise, it would be incompatible with NVSHMEM’s hardware requirements. Note that libfabric is only a software abstraction, support for GPUDirect RDMA also requires compatible hardware, of which Slingshot 11 complies. You would need to investigate whether your hardware meets this criterion.
gdrcopy is only needed for IBRC and UCX. IBGDA does not require gdrcopy (I believe) and yet outperforms IBRC so you need not worry about this.

As long as you have compatible inter-node hardware, preferably one that provides a libfabric provider implementation, then libfabric is the more favorable choice for inter-node communication.

Topic		Replies	Views
NVSHMEM on libfabric optimal configuation GPU-Accelerated Libraries cuda , nvshmem	0	62	September 8, 2025
What is the setting to enable NVSHMEM on multi-node multi-GPU platform with IB GPU-Accelerated Libraries nvshmem	7	438	January 2, 2025
NVSHMEM node exchange performance dropping when going above 4 GPUs per node GPU-Accelerated Libraries nvshmem	1	72	September 2, 2025
NVSHMEM on multi-node GPUs GPU-Accelerated Libraries cuda , nvshmem	8	3008	January 18, 2024
NVSHMEM on 2 node GPUs, small size msg latency is very high GPU-Accelerated Libraries	0	78	February 26, 2025
Internode nvshmme and ib problem GPU-Accelerated Libraries nvshmem	20	1650	April 24, 2024
NVSHMEM setup GPU-Accelerated Libraries gpu-computing	0	145	October 6, 2024
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	1165	April 1, 2024
NVSHMEM applicability in setting of two PCIe connected GPUs GPU-Accelerated Libraries nvshmem	6	1119	October 12, 2021
NVSHMEM Compilling GPU-Accelerated Libraries nvshmem	5	813	January 2, 2024

What is the NVSHMEM configuration setting to achieve optimal performance on multi-node HPC cluster

Related topics