I’ve watched the lecture by Stephen Jones on CUDA(https://www.youtube.com/watch?v=QQceTDjA4f4&t=689s) and I’m trying to reproduce his strided memory read benchmark, I essentially do this by running a vector copy kernel:
__global__ void copy(int n , float* in, float* out, int stride)
{
unsigned long i = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
out[i] = in[i];
}
I’m getting very similar results but for me the first plateau starts at 128 bytes between reads and not 64, I’m testing on the same GPU so it seems weird to me, I double checked and the burst size is 64 bytes as mentioned in the talk. Does anyone have insights on why is this happening?
Here are his results:
Here are my results for different block sizes: