global memory access approach - not just coalescing?

I am a newbie to cuda, and currently reviewing CUDA SDK examples.

Could someone explain why the following code sample (from dwHaar1D example) reads global memory (idata* float) twice, second time after block size stride.

shared[tid] = id[idata];

	shared[tid + bdim] = id[idata + bdim];


Essentially only the second call (

shared[tid + bdim] = id[idata + bdim];

) is of the only interest, as its looks like other than just thread access coalesching and “hide latency” trick is being used here. Is it somehow related to multi-bank nature of global memory?