Can anyone please clarify whether Distributed shared memory , introduced in Hopper architecture, always perform better than shared memory implementation or not?
Accessing the distributed shared memory of a different thread block will be slower than accessing the local shared memory.
In the memory hierarchy, distributed shared memory sits between local shared memory and global memory.
Implementation using shared mem > using distributed shared memory > using Global memory
So is this the normal behaviour? or are there any exceptions to this?
The access speed behaves like that, yes.
Shared memory (L1 cache) is faster than distributed shared memory (which also uses L1 data path but not L2 according to this post) and global memory.
I am not sure what you mean by implementation. What is your use-case?
But theoretically , flushing the current load, then loading the elements again from global memory , is more expensive than getting it remotely from shared memory of another SM right? Then how come shared memory is faster than Distributed shared memory?
Well, accessing data which is local to the SM which you are executing on is faster than accessing remote data which resides in another SM.
It also depends on the size of your data, the access pattern, the needed memory bandwidth compared to computations, the reuse vs. one-time use, the distribution on different SMs for distributed shared memory (e.g. one holds the data or all hold the data, one reads, all read), needed SM synchronization, if the SM to read has a pattern or is determined dynamically.
So please just expect rules-of-thumb advice and look for the bottleneck in your kernel.
Even with your requirements defined, there are often several ways to implement them.
The paper “Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis” by Luo et al. provides micro benchmarks for distributed shared memory in Section 7. It reports latency and throughput for different access patterns.
Amongst others, it gives the following numbers:
Local shared memory latency: 29 cycles
Remote shared memory latency with a cluster size of 2: 181 cycles