Shared Memory next to ALU control?

In page 5 of the CUDA Programming Guide,…

applications can take advantage of it by minimizing overfetch and round-trips to DRAM and therefore becoming less dependent on DRAM memory bandwidth?

My question is:

What could this possibly mean?
What does over-fetch and round-trips to DRAM mean? Why will there be over-fetching? I was even thinking whether it’s possible to do pre-fetching for some of the scientific CUDA Code.

Thanks.

Please don’t post the same question to multiple threads. See the response in the shared memory bandwidth thread.

I was just worried that my question was not completely related to the threads I read, which was why I posted it twice.

Deeply Thanks for your reminder.