Effective bandwidth

I am trying to understand the graph in the Best Practices Guide which shows the effective bandwidth vs. offset for the example below (again from the Best Practices Guide):

__global__ void offsetCopy(float *odata, float* idata, int offset) { 

int xid = blockIdx.x * blockDim.x + threadIdx.x + offset; 

odata = idata[xid]; 

}

The best-case scenario (coalesced) is when offset = 0 or multiples of 16. I don’t quite understand how the effective bandwidth is calculated to be 60 GB/sec for a FX5600 and 120 GB/sec for a GTX280 for this case. Also, for the worst-case, the effective bandwidth is given as 6.6 GB/sec for the Quadro and 66 GB/sec for the GTX 280. How are these numbers arrived at?

Does this include the load from idata and store to odata?