Is there any report on DSMEM bandwidth on H100 or specific usage examples?

Is there any detailed report or data on DSMEM (Distributed Shared Memory) bandwidth on NVIDIA H100 GPUs? Additionally, are there any specific examples or case studies demonstrating its usage?

specifically, for GEMM

Perhaps best about bandwidth measurement would be

Luo et al: Benchmarking and Dissecting the Nvidia Hopper GPU Architecture:
https://arxiv.org/pdf/2402.13499v1

Well, thanks! Just that is not for GEMM. But I get to know DSM’s bandwidth could be 3TB/s. Interesting!

You could also use cuBLAS functions and look into the performance with Compute NSight.