Distributed Shared Memory vs Shared memory implementation

jb_13 · March 16, 2026, 8:35am

Can anyone please clarify whether Distributed shared memory , introduced in Hopper architecture, always perform better than shared memory implementation or not?

striker159 · March 16, 2026, 12:03pm

Accessing the distributed shared memory of a different thread block will be slower than accessing the local shared memory.

In the memory hierarchy, distributed shared memory sits between local shared memory and global memory.

jb_13 · March 16, 2026, 12:12pm

Implementation using shared mem > using distributed shared memory > using Global memory
So is this the normal behaviour? or are there any exceptions to this?

striker159 · March 16, 2026, 12:42pm

The access speed behaves like that, yes.

Shared memory (L1 cache) is faster than distributed shared memory (which also uses L1 data path but not L2 according to this post) and global memory.

I am not sure what you mean by implementation. What is your use-case?

jb_13 · March 18, 2026, 10:22am

But theoretically , flushing the current load, then loading the elements again from global memory , is more expensive than getting it remotely from shared memory of another SM right? Then how come shared memory is faster than Distributed shared memory?

striker159 · March 18, 2026, 11:37am

Well, accessing data which is local to the SM which you are executing on is faster than accessing remote data which resides in another SM.

Curefab · March 19, 2026, 11:58am

It also depends on the size of your data, the access pattern, the needed memory bandwidth compared to computations, the reuse vs. one-time use, the distribution on different SMs for distributed shared memory (e.g. one holds the data or all hold the data, one reads, all read), needed SM synchronization, if the SM to read has a pattern or is determined dynamically.

So please just expect rules-of-thumb advice and look for the bottleneck in your kernel.

Even with your requirements defined, there are often several ways to implement them.

striker159 · March 19, 2026, 3:26pm

The paper “Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis” by Luo et al. provides micro benchmarks for distributed shared memory in Section 7. It reports latency and throughput for different access patterns.

Amongst others, it gives the following numbers:

Local shared memory latency: 29 cycles

Remote shared memory latency with a cluster size of 2: 181 cycles

Topic		Replies	Views
Performance of diagonal access to distributed shared memory CUDA Programming and Performance	3	1436	February 23, 2024
access speed of shared memory and global memory CUDA Programming and Performance	1	1115	August 6, 2009
Shared memory as slow as global memory CUDA Programming and Performance	8	4633	September 5, 2016
Device memory VS Shared memory CUDA Programming and Performance	4	4250	September 22, 2008
What is the inter-SM linkage of DSM(cluster)? CUDA Programming and Performance	5	714	March 3, 2025
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7635	July 21, 2008
How much faster is shared memory vs global memory? has anyone run some tests? CUDA Programming and Performance	4	9100	December 11, 2007
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3316	October 14, 2009
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1927	August 20, 2010
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1258	April 26, 2013

Distributed Shared Memory vs Shared memory implementation

Related topics