Hi everyone,
I am running the Jacobi solver using NVSHMEM, which is provided in the GitHub of Nvidia.
I got very poor performance on the A100 cluster. This result seems unusual, as I have previously run the code on another A100 cluster, where the efficiency was over 90% with 2-4 GPUs
Here are the performance results(internode):
Num GPUs: 2.
16384x16384: 1 GPU: 3.8456 s, 2 GPUs: 3.8965 s, speedup: 0.99, efficiency: 49.35
Num GPUs: 3.
16384x16384: 1 GPU: 3.8329 s, 3 GPUs: 5.5050 s, speedup: 0.70, efficiency: 23.21
Num GPUs: 4.
16384x16384: 1 GPU: 3.8377 s, 4 GPUs: 7.5464 s, speedup: 0.51, efficiency: 12.71
I am wondering if I might be using an outdated version of NVSHMEM, or if there could be other potential reasons causing this performance degradation.
Here is the NVSHMEM configuration:
CUDA API 12000
CUDA Runtime 12030
CUDA Driver 12010
Build Timestamp Nov 29 2022 10:33:25
I would appreciate any insights or suggestions regarding this issue.