Is there any detailed report or data on DSMEM (Distributed Shared Memory) bandwidth on NVIDIA H100 GPUs? Additionally, are there any specific examples or case studies demonstrating its usage?
specifically, for GEMM
Perhaps best about bandwidth measurement would be
Luo et al: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture:
https://arxiv.org/pdf/2402.13499v1
Well, thanks! Just that is not for GEMM. But I get to know DSM’s bandwidth could be 3TB/s. Interesting!
You could also use cuBLAS functions and look into the performance with Compute NSight.