NVSHMEM Performance Issue on A100 Cluster

Youyi1997 · October 9, 2024, 7:06pm

Hi everyone,

I am running the Jacobi solver using NVSHMEM, which is provided in the GitHub of Nvidia.
I got very poor performance on the A100 cluster. This result seems unusual, as I have previously run the code on another A100 cluster, where the efficiency was over 90% with 2-4 GPUs

Here are the performance results(internode):
Num GPUs: 2.
16384x16384: 1 GPU: 3.8456 s, 2 GPUs: 3.8965 s, speedup: 0.99, efficiency: 49.35
Num GPUs: 3.
16384x16384: 1 GPU: 3.8329 s, 3 GPUs: 5.5050 s, speedup: 0.70, efficiency: 23.21
Num GPUs: 4.
16384x16384: 1 GPU: 3.8377 s, 4 GPUs: 7.5464 s, speedup: 0.51, efficiency: 12.71

I am wondering if I might be using an outdated version of NVSHMEM, or if there could be other potential reasons causing this performance degradation.
Here is the NVSHMEM configuration:
CUDA API 12000
CUDA Runtime 12030
CUDA Driver 12010
Build Timestamp Nov 29 2022 10:33:25

I would appreciate any insights or suggestions regarding this issue.

Osayamen · October 10, 2024, 6:51pm

@Youyi1997 Hey, not affiliated with NVIDIA, but fairly familiar with NVSHMEM.

Please try to do the following to provide more information:

It seems you are using a version from 2022, this may not be particularly relevant, but maybe try using the more recent 3.0.6 version and ensure to follow the build suggestions from this GTC’24 [S61339] talk
Run with NVSHMEM_DEBUG=TRACE and NVSHMEM_DEBUG_SUBSYS=ALL and share the resulting logs.
Try profiling your application on Nsight Systems, set NVSHMEM_NVTX-ALL, and sharing the traces too.
Describe the A100 cluster: inter-node network bandwidth?
Describe the network transport: IBGDA? libfabric? etc.
The Jacobi application uses MPI, what implementation of MPI are you using?

Topic		Replies	Views
NVSHMEM Performance Test on A100 GPU-Accelerated Libraries nvshmem	2	1020	August 11, 2023
NVSHMEM on 2 node GPUs, small size msg latency is very high GPU-Accelerated Libraries	0	92	February 26, 2025
NVSHMEM on libfabric optimal configuation GPU-Accelerated Libraries cuda , nvshmem	0	83	September 8, 2025
Error when running NVSHMEM perftest GPU-Accelerated Libraries nvshmem	3	521	January 16, 2025
What is the NVSHMEM configuration setting to achieve optimal performance on multi-node HPC cluster GPU-Accelerated Libraries nvshmem	3	188	January 2, 2025
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	1210	April 1, 2024
Unable to run NVSHMEM example with slurm GPU-Accelerated Libraries nvshmem	4	648	March 31, 2024
NVSHMEM node exchange performance dropping when going above 4 GPUs per node GPU-Accelerated Libraries nvshmem	1	84	September 2, 2025
NVSHMEM setup GPU-Accelerated Libraries gpu-computing	0	165	October 6, 2024
Potential NVSHMEM allocated memory performance issue GPU-Accelerated Libraries nvshmem	19	1771	May 10, 2024

NVSHMEM Performance Issue on A100 Cluster

Related topics