# How do I understand data prefetching with a double buffer?

Hello, I’m a inexperienced person in using CUDA C.
One question I have is that for the matrix multiplication code with the following structure, why he can parallelize the computation with IO, i.e. use double buffer for data prefetching. In my thinking with C++, the for loop in it is that it should execute the code sequentially, I don’t know why it is parallelized. I know that the FFMA and Load/Store Unit in the GPU can execute instructions in parallel, but corresponding to the cuda core functions, I’m having a hard time understanding it, thank you!

``````__global__ void MatrixMulKernel_Shared_Prefetch(float *A, float *B, float *C, int numARows, int numAColumns,  int numBRows, int numBColumns, int numCRows, int numCColumns)
{
__shared__ float sharedM[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float sharedN[BLOCK_SIZE][BLOCK_SIZE];
float sum = 0.0;
float gl_a, gl_b;

int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;

if (col < numAColumns && row < numARows)
else

if (row < numBRows && col < numBColumns)
else

for (int i = 1; i < (int)(ceil((float)numAColumns / blockDim.x)); i++)
{
if (i*blockDim.x + threadIdx.x < numAColumns && row < numARows)
gl_a = A[row*numAColumns + i*blockDim.x + threadIdx.x];
else
gl_a = 0.0;

if (i*blockDim.y + threadIdx.y < numBRows && col < numBColumns)
gl_b = B[(i*blockDim.y + threadIdx.y)*numBColumns + col];
else
gl_b = 0.0;

int stride_i = (i + 1) % 2;  // compute flag
int stride_j = i % 2;  // IO flag

for (int j = 0; j < blockDim.x; j++)

}

for (int j = 0; j < blockDim.x; j++)
That considers things at the warp level. Furthermore, although not connected to the “double buffering” depicted here in the same way, if we consider things at the block level, then the execution behavior of two blocks might be completely disconnected (`__syncthreads()` is a block level execution barrier. It does not synchronize between blocks.) Therefore amongst 2 blocks, one block could be doing any portion of the loading, while another block could be doing any portion of the calculations. These two blocks might even be resident on the same SM.