I am a newbie to cuda, and currently reviewing CUDA SDK examples.
Could someone explain why the following code sample (from dwHaar1D example) reads global memory (idata* float) twice, second time after block size stride.
shared[tid] = id[idata];
shared[tid + bdim] = id[idata + bdim];
__syncthreads();
Essentially only the second call (
shared[tid + bdim] = id[idata + bdim];
) is of the only interest, as its looks like other than just thread access coalesching and “hide latency” trick is being used here. Is it somehow related to multi-bank nature of global memory?