Reproducing strided memory access benchmark

I’ve watched the lecture by Stephen Jones on CUDA(https://www.youtube.com/watch?v=QQceTDjA4f4&t=689s) and I’m trying to reproduce his strided memory read benchmark, I essentially do this by running a vector copy kernel:

__global__ void copy(int n , float* in, float* out, int stride)
{
  unsigned long i = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
  out[i] = in[i];
}

I’m getting very similar results but for me the first plateau starts at 128 bytes between reads and not 64, I’m testing on the same GPU so it seems weird to me, I double checked and the burst size is 64 bytes as mentioned in the talk. Does anyone have insights on why is this happening?

Here are his results:

Here are my results for different block sizes:

You are copying float, but write 8-byte reads in the chart?

Oh, the black background chart is from the presentation, the white background chart is mine. Also it’s not how many bytes are read but what is the stride between successive reads

Yes, the axis is the stride, was referring to the sub-title.
Perhaps you should also measure 8-byte accesses (float2, double or long long int instead of float) to make sure that is not the reason for the different results.

Tried with double, first plateau also happened at 128B stride