Why shared memory has lower bandwidth/multiprocessor than global memory?

For Tesla 1060,

global bandwidth ~= 95GiB/s

smem bandwidth ~= 40GiB/s

Both transmit same #bits/transfer:

smem: 16 banks * 4 bytes/bank
global: 8 banks * 8 bytes/bank

Why the difference? Did the designers not see the need for higher bandwidth/multiprocessor and make a trade off? I definitely would like shared memory to be faster, which will speed up my convolution code. Using register blocking will be faster, but is more complicated.

I think that if number of active threads per sM is more than 256, then latency of shared memory can be hidden.

It means that cost of load/store a shared memory address (no bank-conflict) is the same as cost of a MAD.

using register blocking can save number of load/store shared memory and then performance is better.

Shared memory latency is only a few cycles, so a few half warps is enough (e.g. 64 threads) to hide latency, assuming the hardware even bothers switching for such a short operation. In my microbenchmark, I ensure no bank conflicts and I use 512 threads, 30 multiprocessors, which gets 1200GiB/s => 40GiB / multiprocessor, which means the throughput for shared memory is 0.5 load / cycle / bank.

Can anyone confirm?