What is the inter-SM linkage of DSM(cluster)?

Hi! I am wondering, within a DSM, a cluster, how SM is connected? Like, thread from SM0 access shared memory from SM0 will be the fastest, and thread from SM0 to access shared memory from SM1 will be slower. But how about it access SM2? SM3? The same?

Because the NoC might be star or ring or combination. Any details could be provided? Thanks!

I don’t believe there are published details to answer your questions. This thread may be of interest, perhaps.

One of the main takeaways from that thread is that when accessing non-local DSM, i.e. accessing shared memory that belongs to another threadblock, then the preferred access patterns will be similar to preferred access patterns for global memory, e.g. adjacent/contiguous.

1 Like

For documentation about the recommended access patterns see the tuning guide (from here: NVIDIA Hopper Tuning Guide) - they write the same as Robert already stated and linked to from the other post:

In order to achieve best performance for accesses to Distributed Shared Memory, access patterns to those described in the CUDA C++ Best Practices Guide for Global Memory should be used. Specifically, accesses to Distributed Shared Memory should be coalesced and aligned to 32-byte segments, if possible. Access patterns with non-unit stride should be avoided if possible, which can be achieved by using local shared memory, similar to what is shown in the CUDA C++ Best Practices Guide for Shared Memory.

You can find some 3rd-party information about the SM-SM interconnect properties in the paper “Benchmarking and Dissecting the Nvidia Hopper
GPU Architecture” https://arxiv.org/pdf/2402.13499

The gist seems to be that the different SM-SM connections compete for bandwidth resources.

Also Nvidia specified, 8 SMs per cluster is the compatible setting for other architectures, but Hopper specifically supports 16. This seems to be a hint that adding more SMs to a cluster does not seem to be overly expensive, so it would probably not be a quadratic intraconnect effort (-> not each SM connected to each other with full bandwidth ).

I would assume (without any specific Hopper knowledge apart from the named sources) some kind of several parallel buses or crossbar architecture with 2048 Bytes/cycle comprising 16 independent transfers at the same time (16 * 128 bytes). That would lead to 1755 MHz * 2048 Bytes, which fits to the 3,27 TB/s from the paper.

Those 3.27 TB/s were achieved with a cluster size of 2, however one can assume (as the memory bandwidth of each SM is less) that they measured 8 clusters of size 2 and summed up the result of the 8 clusters and of the 2 blocks=SM per cluster. So probably each SM can transmit and/or receive only one transfer of 128 bytes per cycle at a time. The number 16 derived from the bandwidth fits well with the number 16 of maximum SMs per cluster on the H100.

One could test, whether the granularity of transfers actually is 128 bytes or 32 bytes.

However, Fig. 9 could indicate an even higher throughput for the ‘histogram application’ than for the throughput benchmark of Fig. 8. I presume that, as no unit like TB/s is stated in Fig. 9, it is a relative throughput, and overall less than the 3.27 TB/s. Also the text gives 3.27 TB/s as peak throughput and no absolute numbers for Fig. 9.

Distributed Shared Memory. SM-to-SM network latency is
180 cycles, a 32% reduction compared to L2 cache. This
validates the advantages of the network, facilitating efficient
data exchange from producers to consumers.
In Fig. 8, SM-to-SM throughput is illustrated for varying
cluster and block sizes. As typically observed in similar bench-
marks, larger block sizes and more parallelizable instructions
result in higher throughputs. A peak throughput of nearly
3.27 TB/s is observed with a cluster size of 2, reducing to
2.65 TB/s with a cluster size of 4. Interestingly, as more
blocks in the cluster compete for SM-to-SM bandwidth, the
overall throughput gets lower and lower. While a larger cluster
size can reduce data movement latency for more blocks,
it intensifies throughput competition. Balancing this tradeoff
by selecting optimal block and cluster sizes is an important
direction for exploration.
Fig. 9 displays the histogram throughput with distributed
shared memory. First, the optimal cluster size differs for vari-
ous block sizes (CS=4 for block size 128, CS=2 for block size
512). Increasing block and cluster sizes can saturate SM-to-SM
network utilization, potentially degrading overall performance
due to resource contention. Second, a notable performance
drop occurs from 1024 to 2048 Nbins when CS=1. Larger
Nbins demand more shared memory space and limit active
block numbers on an SM. Employing the cluster mechanism to
divide Nbins within the same cluster enhances block concur-
rency, mitigating this issue. Lastly, although shared memory is
not a limiting factor for active block numbers with block size
= 512, choosing an appropriate cluster size ease the on-chip
shared memory traffic by leveraging the SM-to-SM network
resource, ultimately improving overall performance.

2 Likes

Actually, it’s very strange that multiplying by 16 yields 3 TB/s. In my experiments, I only achieved around 3 TB/s when a very large number of SMs were running concurrently—in fact, when 132 SMs were fully utilized. According to your calculation, wouldn’t we also need to multiply by the number of clusters? That would make the theoretical value far too high, well beyond 3 TB/s


It is not an individual transfer speed or bandwidth, but the sum of several SMs. If you compare to the (non-cluster) shared memory speeds, you get similar numbers.

Can you describe your experiments, please? How did you measure and calculate?

Have you added up all the transferred data from each SM? Then the speed would only be similar to L2 speed?

I have tested by myself. All H100(132SM), 3TB/S. But here, theoretical computation get: 16 SM itself: 3TB/S. Strange! Right?