What is the inter-SM linkage of DSM(cluster)?

202476410arsmart · July 20, 2024, 9:23am

Hi! I am wondering, within a DSM, a cluster, how SM is connected? Like, thread from SM0 access shared memory from SM0 will be the fastest, and thread from SM0 to access shared memory from SM1 will be slower. But how about it access SM2? SM3? The same?

Because the NoC might be star or ring or combination. Any details could be provided? Thanks!

Robert_Crovella · July 20, 2024, 2:36pm

I don’t believe there are published details to answer your questions. This thread may be of interest, perhaps.

One of the main takeaways from that thread is that when accessing non-local DSM, i.e. accessing shared memory that belongs to another threadblock, then the preferred access patterns will be similar to preferred access patterns for global memory, e.g. adjacent/contiguous.

Curefab · July 21, 2024, 11:06am

For documentation about the recommended access patterns see the tuning guide (from here: NVIDIA Hopper Tuning Guide) - they write the same as Robert already stated and linked to from the other post:

In order to achieve best performance for accesses to Distributed Shared Memory, access patterns to those described in the CUDA C++ Best Practices Guide for Global Memory should be used. Specifically, accesses to Distributed Shared Memory should be coalesced and aligned to 32-byte segments, if possible. Access patterns with non-unit stride should be avoided if possible, which can be achieved by using local shared memory, similar to what is shown in the CUDA C++ Best Practices Guide for Shared Memory.

You can find some 3rd-party information about the SM-SM interconnect properties in the paper “Benchmarking and Dissecting the Nvidia Hopper
GPU Architecture” https://arxiv.org/pdf/2402.13499

The gist seems to be that the different SM-SM connections compete for bandwidth resources.

Also Nvidia specified, 8 SMs per cluster is the compatible setting for other architectures, but Hopper specifically supports 16. This seems to be a hint that adding more SMs to a cluster does not seem to be overly expensive, so it would probably not be a quadratic intraconnect effort (-> not each SM connected to each other with full bandwidth ).

I would assume (without any specific Hopper knowledge apart from the named sources) some kind of several parallel buses or crossbar architecture with 2048 Bytes/cycle comprising 16 independent transfers at the same time (16 * 128 bytes). That would lead to 1755 MHz * 2048 Bytes, which fits to the 3,27 TB/s from the paper.

Those 3.27 TB/s were achieved with a cluster size of 2, however one can assume (as the memory bandwidth of each SM is less) that they measured 8 clusters of size 2 and summed up the result of the 8 clusters and of the 2 blocks=SM per cluster. So probably each SM can transmit and/or receive only one transfer of 128 bytes per cycle at a time. The number 16 derived from the bandwidth fits well with the number 16 of maximum SMs per cluster on the H100.

One could test, whether the granularity of transfers actually is 128 bytes or 32 bytes.

However, Fig. 9 could indicate an even higher throughput for the ‘histogram application’ than for the throughput benchmark of Fig. 8. I presume that, as no unit like TB/s is stated in Fig. 9, it is a relative throughput, and overall less than the 3.27 TB/s. Also the text gives 3.27 TB/s as peak throughput and no absolute numbers for Fig. 9.

Distributed Shared Memory. SM-to-SM network latency is
180 cycles, a 32% reduction compared to L2 cache. This
validates the advantages of the network, facilitating efficient
data exchange from producers to consumers.
In Fig. 8, SM-to-SM throughput is illustrated for varying
cluster and block sizes. As typically observed in similar bench-
marks, larger block sizes and more parallelizable instructions
result in higher throughputs. A peak throughput of nearly
3.27 TB/s is observed with a cluster size of 2, reducing to
2.65 TB/s with a cluster size of 4. Interestingly, as more
blocks in the cluster compete for SM-to-SM bandwidth, the
overall throughput gets lower and lower. While a larger cluster
size can reduce data movement latency for more blocks,
it intensifies throughput competition. Balancing this tradeoff
by selecting optimal block and cluster sizes is an important
direction for exploration.
Fig. 9 displays the histogram throughput with distributed
shared memory. First, the optimal cluster size differs for vari-
ous block sizes (CS=4 for block size 128, CS=2 for block size
512). Increasing block and cluster sizes can saturate SM-to-SM
network utilization, potentially degrading overall performance
due to resource contention. Second, a notable performance
drop occurs from 1024 to 2048 Nbins when CS=1. Larger
Nbins demand more shared memory space and limit active
block numbers on an SM. Employing the cluster mechanism to
divide Nbins within the same cluster enhances block concur-
rency, mitigating this issue. Lastly, although shared memory is
not a limiting factor for active block numbers with block size
= 512, choosing an appropriate cluster size ease the on-chip
shared memory traffic by leveraging the SM-to-SM network
resource, ultimately improving overall performance.

202476410arsmart · March 3, 2025, 8:54am

Actually, it’s very strange that multiplying by 16 yields 3 TB/s. In my experiments, I only achieved around 3 TB/s when a very large number of SMs were running concurrently—in fact, when 132 SMs were fully utilized. According to your calculation, wouldn’t we also need to multiply by the number of clusters? That would make the theoretical value far too high, well beyond 3 TB/s…

Curefab · March 3, 2025, 9:59am

It is not an individual transfer speed or bandwidth, but the sum of several SMs. If you compare to the (non-cluster) shared memory speeds, you get similar numbers.

Can you describe your experiments, please? How did you measure and calculate?

Have you added up all the transferred data from each SM? Then the speed would only be similar to L2 speed?

202476410arsmart · March 3, 2025, 10:00am

I have tested by myself. All H100(132SM), 3TB/S. But here, theoretical computation get: 16 SM itself: 3TB/S. Strange! Right?

Curefab · May 4, 2026, 4:59pm

132 SM is 8 clusters with 16 SMs.

Max overall cluster speed is 256 bytes per cluster/cycle. So 8x smaller than 128 bytes x 16 SM = 2048 bytes/cycle for summed up local shared memory bandwidth.

The numbers are the same, because of 8 clusters and 1/8 of speed, so it is coincidence.

I would expect it to be a crossbar connection. Probably reaching up to 128 bytes read or write speed per SM. But not for all SMs at the same time. As it is processed by L1, the minimum unit is either 32 bytes or 128 bytes.

IIRC the datacenter variants support broadcast, the consumer variants not.

Just guessing: If it is a simplified architecture, instead of a full crossbar there could be 4 busses with 64 bytes width each, each bus has 4 input SMs and 4 output SMs.

Or 8 32 bytes busses with pairwise the same connections, but direction (input vs. output) reversed.

Nanodeoclus · May 6, 2026, 2:46pm

Not quite.

Only 7 GPCs on H100 NVL/SXM (and only 6 GPCs on H100 PCIe, which has 114 SMs) are capable of running clusters of size 16.

Curefab · May 6, 2026, 2:54pm

According to
https://www.advancedclustering.com/wp-content/uploads/2022/03/gtc22-whitepaper-hopper.pdf (page 18):

full implementation of GH100: 8 GPCs, 144 SMs
H100 SXM5: 8 GPCs, 132 SMs
H100 PCiE: 7 or 8 GPCs, 114 SMs

So the full implementation is 8x18.
Which TPcs (pair of SMs) is deactivated, is not known beforehand. As you say, there probably are minimum standards, so that at least 7 (or 6, respectively) GPCs offer a cluster size of 16.

So I stand corrected.

Greg · May 8, 2026, 5:38am

Each GPC has an internal crossbar for DSMEM operations independent of GPU NOC used to access the L2.

Each SM can store or read at 32B/cycle.

The store/request interface is shared with L1; however, it vectors to GPC crossbar vs. going out of the GPC to the memory crossbar.
The recive port is 32B/cycle. This arbitrates vs. the fill return ports for the memory crossbar.

The GPC internal crossbar is not detailed in the profilers or documentation. The throughtput varies greatly between 100 class (GH100/GB100) and consumer GPUs such as GB20x (RTX 6000 Blackwell).

Topic		Replies	Views
How does the Thread Block Cluster of the Nvidia H100 work concurrently? CUDA Programming and Performance gpu	26	7838	October 12, 2022
Distributed Shared Memory vs Shared memory implementation CUDA Programming and Performance cuda , kernel , gpu-computing	7	223	March 19, 2026
Performance of diagonal access to distributed shared memory CUDA Programming and Performance	2	1477	February 23, 2024
How do multicore access shared memory at same time? CUDA Programming and Performance	1	5853	January 15, 2009
What will be happen in the situation CUDA Programming and Performance	9	6376	December 23, 2008
Is there any report on DSMEM bandwidth on H100 or specific usage examples? CUDA Programming and Performance	4	358	December 14, 2024
Confusing H100 SXM thread block cluster CUDA Programming and Performance cuda , kernel	12	334	October 29, 2025
Memory bandwidth in terms of SM number CUDA Programming and Performance	1	647	April 2, 2021
Why shared memory has lower bandwidth/multiprocessor than global memory? CUDA Programming and Performance	2	1194	December 6, 2009
SM,share memory size,constant size? CUDA Programming and Performance	8	4950	December 22, 2017

What is the inter-SM linkage of DSM(cluster)?

Related topics