why shared vs. global mem speedup degrades?

hi, this is a general question. shared memory sure has speedups over global memory, assume these two memories are alternately used and timed, what are the possible explanations that sm vs. gm speedup decreases, when the space used is aggressively long?
thanks for quick replies!

Smem should have a higher bandwidth than the global mem (depending on how wide your gmem connection is). In addition to that, gmem shows a lot of latency (200-300 core cycles, if I remember correctly, is written somewhere in the CUDA programmers guide). If you have only a small number of data to pick up from global memory you see this latency and it influences your timings.

If you do have a lot of data to pick up (optimally coalesced) the board can hide those latencies by doing some calculations or send the next data fetch to the memory. So you will still have those 200-300 cycles latency somewhere (because you have to wait for the first data fetch to complete) but this latency stays almost constant over the amount of data you want to use because after the first transfer is done you can get a chunk of data every core cycle.