NVSHMEM Performance Issue on A100 Cluster

Hi everyone,

I am running the Jacobi solver using NVSHMEM, which is provided in the GitHub of Nvidia.
I got very poor performance on the A100 cluster. This result seems unusual, as I have previously run the code on another A100 cluster, where the efficiency was over 90% with 2-4 GPUs

Here are the performance results(internode):
Num GPUs: 2.
16384x16384: 1 GPU: 3.8456 s, 2 GPUs: 3.8965 s, speedup: 0.99, efficiency: 49.35
Num GPUs: 3.
16384x16384: 1 GPU: 3.8329 s, 3 GPUs: 5.5050 s, speedup: 0.70, efficiency: 23.21
Num GPUs: 4.
16384x16384: 1 GPU: 3.8377 s, 4 GPUs: 7.5464 s, speedup: 0.51, efficiency: 12.71

I am wondering if I might be using an outdated version of NVSHMEM, or if there could be other potential reasons causing this performance degradation.
Here is the NVSHMEM configuration:
CUDA API 12000
CUDA Runtime 12030
CUDA Driver 12010
Build Timestamp Nov 29 2022 10:33:25

I would appreciate any insights or suggestions regarding this issue.

@Youyi1997 Hey, not affiliated with NVIDIA, but fairly familiar with NVSHMEM.

Please try to do the following to provide more information:

  1. It seems you are using a version from 2022, this may not be particularly relevant, but maybe try using the more recent 3.0.6 version and ensure to follow the build suggestions from this GTC’24 [S61339] talk
  2. Run with NVSHMEM_DEBUG=TRACE and NVSHMEM_DEBUG_SUBSYS=ALL and share the resulting logs.
  3. Try profiling your application on Nsight Systems, set NVSHMEM_NVTX-ALL, and sharing the traces too.
  4. Describe the A100 cluster: inter-node network bandwidth?
  5. Describe the network transport: IBGDA? libfabric? etc.
  6. The Jacobi application uses MPI, what implementation of MPI are you using?