I am trying to understand the graph in the Best Practices Guide which shows the effective bandwidth vs. offset for the example below (again from the Best Practices Guide):
__global__ void offsetCopy(float *odata, float* idata, int offset) {
int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
odata = idata[xid];
}
The best-case scenario (coalesced) is when offset = 0 or multiples of 16. I don’t quite understand how the effective bandwidth is calculated to be 60 GB/sec for a FX5600 and 120 GB/sec for a GTX280 for this case. Also, for the worst-case, the effective bandwidth is given as 6.6 GB/sec for the Quadro and 66 GB/sec for the GTX 280. How are these numbers arrived at?
Does this include the load from idata and store to odata?