I have quite a basic question I’d like to clear up.
Imagine you have the following loop:
int a = global;
int b = 0;
for(i=0; i < W; ++i)
a = b;
b = global[i+1];
for(j=0; j < H; ++j)
// Do some calculations with a only
The question is: will the pipeline stall at any moment to finish reading b (inside the loop)? Assume that the loop on j is long enough to give time for a full global memory read.
There are 2 levels of interleaving that apply here. 1) while one warp is waiting for b to be read, other warps will be running inside the loop. 2) Even within a warp, execution will not stall for the global read of b until it is actually used.
I hope you really aren’t accessing global[i+1] inside a loop in a thread, the uncoalesced accesses will hurt badly.
Is it also true if you are using gt200, where part of uncoalesced read is handle by the card? How would you rewrite such a read to avoid uncoalesced reads?
You should load the data into a shared memory array of size W+1 and read it once, each thread reading a different gmem location
then all threads can access the data from shared memory and not global mem.
Yes. Depending on the memory access pattern, even a slight deviation from coalesced can mean a factor of 2 or more performance decrease. G200 just reduced the cost of uncoalesced reads, it didn’t remove the performance hit entirely.
It depends on intended memory access pattern. The OP was unclear as to what multiple threads were doing. I.e: if each thread needs their own global list, then the best access pattern is a 1D pitched array of width Nthreads and height W. Then coalesced reads are global[pitch * i + (threadIdx.x + blockIdx.x*blockDim.x)]
If for some reason the data must be stored in a transposed format, then a 2D texture read could be used (which will be sweet in CUDA 2.2 as you can do this straight from device memory instead of needing a copy to a cudaArray).
If all threads are reading from the same global array with the access pattern in the OP, then constant memory is best (if the array fits). After constant, caching in shared with a sliding window of coalesced reads would be best.