For documentation about the recommended access patterns see the tuning guide (from here: NVIDIA Hopper Tuning Guide) - they write the same as Robert already stated and linked to from the other post:
In order to achieve best performance for accesses to Distributed Shared Memory, access patterns to those described in the CUDA C++ Best Practices Guide for Global Memory should be used. Specifically, accesses to Distributed Shared Memory should be coalesced and aligned to 32-byte segments, if possible. Access patterns with non-unit stride should be avoided if possible, which can be achieved by using local shared memory, similar to what is shown in the CUDA C++ Best Practices Guide for Shared Memory.
You can find some 3rd-party information about the SM-SM interconnect properties in the paper âBenchmarking and Dissecting the Nvidia Hopper
GPU Architectureâ https://arxiv.org/pdf/2402.13499
The gist seems to be that the different SM-SM connections compete for bandwidth resources.
Also Nvidia specified, 8 SMs per cluster is the compatible setting for other architectures, but Hopper specifically supports 16. This seems to be a hint that adding more SMs to a cluster does not seem to be overly expensive, so it would probably not be a quadratic intraconnect effort (-> not each SM connected to each other with full bandwidth ).
I would assume (without any specific Hopper knowledge apart from the named sources) some kind of several parallel buses or crossbar architecture with 2048 Bytes/cycle comprising 16 independent transfers at the same time (16 * 128 bytes). That would lead to 1755 MHz * 2048 Bytes, which fits to the 3,27 TB/s from the paper.
Those 3.27 TB/s were achieved with a cluster size of 2, however one can assume (as the memory bandwidth of each SM is less) that they measured 8 clusters of size 2 and summed up the result of the 8 clusters and of the 2 blocks=SM per cluster. So probably each SM can transmit and/or receive only one transfer of 128 bytes per cycle at a time. The number 16 derived from the bandwidth fits well with the number 16 of maximum SMs per cluster on the H100.
One could test, whether the granularity of transfers actually is 128 bytes or 32 bytes.
However, Fig. 9 could indicate an even higher throughput for the âhistogram applicationâ than for the throughput benchmark of Fig. 8. I presume that, as no unit like TB/s is stated in Fig. 9, it is a relative throughput, and overall less than the 3.27 TB/s. Also the text gives 3.27 TB/s as peak throughput and no absolute numbers for Fig. 9.
Distributed Shared Memory. SM-to-SM network latency is
180 cycles, a 32% reduction compared to L2 cache. This
validates the advantages of the network, facilitating efficient
data exchange from producers to consumers.
In Fig. 8, SM-to-SM throughput is illustrated for varying
cluster and block sizes. As typically observed in similar bench-
marks, larger block sizes and more parallelizable instructions
result in higher throughputs. A peak throughput of nearly
3.27 TB/s is observed with a cluster size of 2, reducing to
2.65 TB/s with a cluster size of 4. Interestingly, as more
blocks in the cluster compete for SM-to-SM bandwidth, the
overall throughput gets lower and lower. While a larger cluster
size can reduce data movement latency for more blocks,
it intensifies throughput competition. Balancing this tradeoff
by selecting optimal block and cluster sizes is an important
direction for exploration.
Fig. 9 displays the histogram throughput with distributed
shared memory. First, the optimal cluster size differs for vari-
ous block sizes (CS=4 for block size 128, CS=2 for block size
512). Increasing block and cluster sizes can saturate SM-to-SM
network utilization, potentially degrading overall performance
due to resource contention. Second, a notable performance
drop occurs from 1024 to 2048 Nbins when CS=1. Larger
Nbins demand more shared memory space and limit active
block numbers on an SM. Employing the cluster mechanism to
divide Nbins within the same cluster enhances block concur-
rency, mitigating this issue. Lastly, although shared memory is
not a limiting factor for active block numbers with block size
= 512, choosing an appropriate cluster size ease the on-chip
shared memory traffic by leveraging the SM-to-SM network
resource, ultimately improving overall performance.