How do I understand data prefetching with a double buffer?

Jayce0625 · March 13, 2024, 3:45am

Hello, I’m a inexperienced person in using CUDA C.
One question I have is that for the matrix multiplication code with the following structure, why he can parallelize the computation with IO, i.e. use double buffer for data prefetching. In my thinking with C++, the for loop in it is that it should execute the code sequentially, I don’t know why it is parallelized. I know that the FFMA and Load/Store Unit in the GPU can execute instructions in parallel, but corresponding to the cuda core functions, I’m having a hard time understanding it, thank you!

__global__ void MatrixMulKernel_Shared_Prefetch(float *A, float *B, float *C, int numARows, int numAColumns,  int numBRows, int numBColumns, int numCRows, int numCColumns)
{
	__shared__ float sharedM[BLOCK_SIZE][BLOCK_SIZE];
	__shared__ float sharedN[BLOCK_SIZE][BLOCK_SIZE];
    float sum = 0.0;
    float gl_a, gl_b;
    
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    int col = blockIdx.x*blockDim.x + threadIdx.x;

    if (col < numAColumns && row < numARows)
		sharedM[threadIdx.y][threadIdx.x] = A[row*numAColumns + threadIdx.x];
    else
		sharedM[threadIdx.y][threadIdx.x] = 0.0;

	if (row < numBRows && col < numBColumns)
		sharedN[threadIdx.y][threadIdx.x] = B[threadIdx.y*numBColumns + col];
	else
		sharedN[threadIdx.y][threadIdx.x] = 0.0;
	__syncthreads();

	for (int i = 1; i < (int)(ceil((float)numAColumns / blockDim.x)); i++)
	{
		if (i*blockDim.x + threadIdx.x < numAColumns && row < numARows)
			gl_a = A[row*numAColumns + i*blockDim.x + threadIdx.x];
		else
			gl_a = 0.0;

		if (i*blockDim.y + threadIdx.y < numBRows && col < numBColumns)
			gl_b = B[(i*blockDim.y + threadIdx.y)*numBColumns + col];
		else
			gl_b = 0.0;
		
        int stride_i = (i + 1) % 2;  // compute flag
        int stride_j = i % 2;  // IO flag

	    for (int j = 0; j < blockDim.x; j++)
			sum += sharedM[stride_i*(BLOCK_SIZE/2) + threadIdx.y][j] * sharedN[stride_i*(BLOCK_SIZE/2) + j][threadIdx.x];

        sharedM[stride_j*(BLOCK_SIZE/2) + threadIdx.y][threadIdx.x] = gl_a;
        sharedN[stride_j*(BLOCK_SIZE/2) + threadIdx.y][threadIdx.x] = gl_b;
		__syncthreads();
	}
    
        for (int j = 0; j < blockDim.x; j++)
            sum += sharedM[1*(BLOCK_SIZE/2) + threadIdx.y][j] * sharedN[1*(BLOCK_SIZE/2) + j][threadIdx.x];
        __syncthreads();

	if (row < numCRows && col < numCColumns)
		C[row*numCColumns + col] = sum;
}

Robert_Crovella · March 21, 2024, 3:01pm

Not all threads in CUDA execute in lockstep. CUDA has a hierarchical execution (thread) breakdown that includes the grid at the top level, the thread block below that, the warp below that, and the thread below that. The grid (all threads associated with a kernel launch) generally do not execute in lockstep. A threadblock generally does not execute in lockstep. Only at the warp level do we have typical (but not guaranteed) lockstep execution of threads.

Therefore depending on which point in the code we are talking about, it is possible to have some warps still doing the for-loop associated with the calculations, whereas other warps have finished their calculation and gone on ahead to the load operations. Some warps may still be doing this:

while others have completed that and are doing this:

That considers things at the warp level. Furthermore, although not connected to the “double buffering” depicted here in the same way, if we consider things at the block level, then the execution behavior of two blocks might be completely disconnected (__syncthreads() is a block level execution barrier. It does not synchronize between blocks.) Therefore amongst 2 blocks, one block could be doing any portion of the loading, while another block could be doing any portion of the calculations. These two blocks might even be resident on the same SM.

Topic		Replies	Views
Shared memory Is it context switched? CUDA Programming and Performance	9	11293	December 6, 2007
questions about the NVIDIA programming model and GPU architecture newbie in here.... CUDA Programming and Performance	3	2475	November 10, 2008
Questions about shared memory and branching CUDA Programming and Performance	2	2829	September 5, 2009
Shared memory matrix multiplication not working CUDA Programming and Performance	6	66	October 11, 2024
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4510	October 24, 2008
global to shared mem loads and sync CUDA Programming and Performance	26	11456	February 21, 2008
Example of matrix multiplication (max. block_size) CUDA Programming and Performance	2	11601	January 28, 2010
General CUDA Questions New to CUDA and need some help! CUDA Programming and Performance	8	5988	September 5, 2008
is it possible to use register and shared memory for double buffering technique? CUDA Programming and Performance	4	4955	September 29, 2009
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13941	September 5, 2008

How do I understand data prefetching with a double buffer?

Related topics