Why the difference? Did the designers not see the need for higher bandwidth/multiprocessor and make a trade off? I definitely would like shared memory to be faster, which will speed up my convolution code. Using register blocking will be faster, but is more complicated.
Shared memory latency is only a few cycles, so a few half warps is enough (e.g. 64 threads) to hide latency, assuming the hardware even bothers switching for such a short operation. In my microbenchmark, I ensure no bank conflicts and I use 512 threads, 30 multiprocessors, which gets 1200GiB/s => 40GiB / multiprocessor, which means the throughput for shared memory is 0.5 load / cycle / bank.