I am looking at NVIDIA NVSHMEM Document and would like to follow and check the performance using NVSHMEM on A100.
Currently, I am using the latest 2.9.0-2, which includes the perftest natively. I have completed the build and make the perftest running successfully on 8xA100 within single node. To my surprise, I have found the both Collective and P2P test Performance is very low. The measured bandwidth is very lower than NVLink Bandwidth. I have read some NVIDIA blogs and Docs, it sounds like the NVSHMEM perftest should be able to get close to the NVLink bandwidth by using NVSHMEM and NVLink, but it doesn’t mention how to run.
So, I would like to ask if someone can help me out on the performance test with NVSHMEM