Reproducing strided memory access benchmark

szymon.ozog · October 25, 2024, 1:50pm

I’ve watched the lecture by Stephen Jones on CUDA(https://www.youtube.com/watch?v=QQceTDjA4f4&t=689s) and I’m trying to reproduce his strided memory read benchmark, I essentially do this by running a vector copy kernel:

__global__ void copy(int n , float* in, float* out, int stride)
{
  unsigned long i = (blockIdx.x*blockDim.x + threadIdx.x)*stride;
  out[i] = in[i];
}

I’m getting very similar results but for me the first plateau starts at 128 bytes between reads and not 64, I’m testing on the same GPU so it seems weird to me, I double checked and the burst size is 64 bytes as mentioned in the talk. Does anyone have insights on why is this happening?

Here are his results:

Here are my results for different block sizes:

Curefab · October 25, 2024, 2:14pm

You are copying float, but write 8-byte reads in the chart?

szymon.ozog · October 25, 2024, 2:47pm

Oh, the black background chart is from the presentation, the white background chart is mine. Also it’s not how many bytes are read but what is the stride between successive reads

Curefab · October 25, 2024, 3:17pm

Yes, the axis is the stride, was referring to the sub-title.
Perhaps you should also measure 8-byte accesses (float2, double or long long int instead of float) to make sure that is not the reason for the different results.

szymon.ozog · October 25, 2024, 3:29pm

Tried with double, first plateau also happened at 128B stride

Topic		Replies	Views
block-strided access problem CUDA Programming and Performance	1	884	September 13, 2013
Question about dram_read_transactions CUDA Programming and Performance	3	62	September 23, 2024
unspecified lauch failure... CUDA Programming and Performance	2	562	June 11, 2011
Strided local/global/generic memory accesses on Kepler CUDA Programming and Performance	0	535	November 20, 2013
Configuring the CUDA Kernel CUDA Programming and Performance	0	315	April 17, 2023
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1677	April 7, 2024
Striped memory access CUDA Programming and Performance	3	1167	July 14, 2015
Efficient memory copying with CUFFT's complex type CUDA Programming and Performance	4	714	January 20, 2018
problem with non-caching read (-dlcm=cg) CUDA Programming and Performance	5	3581	September 13, 2013
Strided thread access with data reuse in cache lines CUDA Programming and Performance	4	1744	July 14, 2020

Reproducing strided memory access benchmark

Related topics