NVSHMEM Performance Test on A100

Hi All
I am looking at NVIDIA NVSHMEM Document and would like to follow and check the performance using NVSHMEM on A100.
Currently, I am using the latest 2.9.0-2, which includes the perftest natively. I have completed the build and make the perftest running successfully on 8xA100 within single node. To my surprise, I have found the both Collective and P2P test Performance is very low. The measured bandwidth is very lower than NVLink Bandwidth. I have read some NVIDIA blogs and Docs, it sounds like the NVSHMEM perftest should be able to get close to the NVLink bandwidth by using NVSHMEM and NVLink, but it doesn’t mention how to run.
So, I would like to ask if someone can help me out on the performance test with NVSHMEM

Many Thanks

Hello, If you are running the shmem_put_bw perftest, by default it runs only 4 CTAs. To achieve full NVLInk bandwidth you should try increasing the number of CTAs by passing “-c #CTAs” argument to the program. Please let us know how it goes.

Many Thanks for the Reply!
Currently, I am doing the comparison on collective performance test, such as alltoall latency. I have tried some config.s on CTAs, e.g. 64, it can increase the performance, from 2GB/s to 4GB/s, but I think it still be far away from the NCCL and NVlink bandwdith.

Any suggtions on this point??

Cheers