I also decided that my DRAM knowledge is somewhat dated and removed most of the details from my previous post. SDRAM can achieve nearly peak theoretical bandwidth for perfectly optimal access patterns (which include a requirement for no bus turnarounds), but I’m not familiar enough with the organization of memory segments as seen by the memory controller to actual banks within SDRAM partitions to be able to say what would be required to achieve this from a CUDA memory address standpoint. It may be trivial to achieve, or nearly impossible to achieve in practice, I’m not sure.
I also think SDRAM still suffers from some refresh overhead, but I think this is potentially quite small (~0.4%).
Nevertheless, observed CUDA memory bandwidths usually incur some penalty against theoretical maximum, on the order of 5-25%, approximately. I know of no similar reduction for shared memory.