Bank conflict of tiled matrix multiplication

Ushuaia · May 17, 2023, 7:34pm

The 32*32 tiled matrix multiplication kernel has store bank conflicts which couldn’t be explained:

constexpr int DSIZE = 8192;
constexpr int block_size = 32;

// matrix multiply kernel: C = A * B
__global__ void mmul(const float *A, const float *B, float *C, int ds) {

  // declare cache in shared memory
  __shared__ float As[block_size][block_size];
  __shared__ float Bs[block_size][block_size];

  int idx = threadIdx.x + blockDim.x * blockIdx.x; // create thread x index
  int idy = threadIdx.y + blockDim.y * blockIdx.y; // create thread y index

  if ((idx < ds) && (idy < ds)) {
    float temp = 0;
    for (int i = 0; i < ds / block_size; i++) {

      // Load data into shared memory. HAS store bank conflicts!
      As[threadIdx.y][threadIdx.x] =
          A[idy * ds + (i * block_size + threadIdx.x)];
      Bs[threadIdx.y][threadIdx.x] =
          B[(i * block_size + threadIdx.y) * ds + idx];

      __syncthreads();

      for (int k = 0; k < block_size; k++)
        // Keep track of the running sum. NO load bank conflicts!!!
        temp += As[threadIdx.y][k] *
                Bs[k][threadIdx.x]; // dot product of row and column

      __syncthreads();
    }

    // Write to global memory
    C[idy * ds + idx] = temp;
  }
}

The value of threadIdx.x was in [0, 32) so the shared memory store ops seemed to be conflict free. But ncu said there were a lot of bank conflicts:

mmul(const float *, const float *, float *, int) (256, 256, 1)x(32, 32, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics

Metric Name Metric Unit Metric Value

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 252249457

It’s also strange that the load ops had no conflicts with the same addressing pattern.

Furthermore, the same kernel with block_size = 16 has no store/load bank conflicts:

mmul(const float *, const float *, float *, int) (512, 512, 1)x(16, 16, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics

Metric Name Metric Unit Metric Value

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0

Can anyone help explain what’s the root cause of bank conflicts for 32*32 tiling? Thanks!

Robert_Crovella · May 17, 2023, 7:45pm

shared memory bank conflicts are a difficult thing to capture and understand correctly using nsight compute. Some relevant comments are here.

Topic		Replies	Views
Bank conflicts confusion for tiled matrix multiplication CUDA Programming and Performance	1	274	February 27, 2024
The question of the example of "3.2.2.3 Shared Memory in Matrix Multiplication(C=A*A(T)" i CUDA Programming and Performance	0	1896	September 17, 2009
Matrix Multiplication Bank conflicts problem GPU-Accelerated Libraries	17	1726	August 31, 2018
A question about load shared memory in matrix multiplication CUDA Programming and Performance cuda	4	130	December 1, 2024
Will this code cause bank conflict ? CUDA Programming and Performance	1	451	October 9, 2018
Shared memory matrix multiplication not working CUDA Programming and Performance	6	74	October 11, 2024
Matrix Multiplication and Bank conflicts code included CUDA Programming and Performance	3	1358	April 16, 2012
What is "Other Bank Conflicts" CUDA Programming and Performance	0	35	February 24, 2025
Why there is random bank conflicts? CUDA-MEMCHECK cuda	2	1215	September 19, 2023
About bank conflict of shared_mem CUDA Programming and Performance	2	473	July 25, 2023

Bank conflict of tiled matrix multiplication

Related topics