# Bank conflict of tiled matrix multiplication

The 32*32 tiled matrix multiplication kernel has store bank conflicts which couldn’t be explained:

``````constexpr int DSIZE = 8192;
constexpr int block_size = 32;

// matrix multiply kernel: C = A * B
__global__ void mmul(const float *A, const float *B, float *C, int ds) {

// declare cache in shared memory
__shared__ float As[block_size][block_size];
__shared__ float Bs[block_size][block_size];

int idx = threadIdx.x + blockDim.x * blockIdx.x; // create thread x index
int idy = threadIdx.y + blockDim.y * blockIdx.y; // create thread y index

if ((idx < ds) && (idy < ds)) {
float temp = 0;
for (int i = 0; i < ds / block_size; i++) {

// Load data into shared memory. HAS store bank conflicts!
A[idy * ds + (i * block_size + threadIdx.x)];
B[(i * block_size + threadIdx.y) * ds + idx];

for (int k = 0; k < block_size; k++)
// Keep track of the running sum. NO load bank conflicts!!!
Bs[k][threadIdx.x]; // dot product of row and column

}

// Write to global memory
C[idy * ds + idx] = temp;
}
}
``````

The value of threadIdx.x was in [0, 32) so the shared memory store ops seemed to be conflict free. But ncu said there were a lot of bank conflicts:

mmul(const float *, const float *, float *, int) (256, 256, 1)x(32, 32, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics

Metric Name Metric Unit Metric Value

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 252249457

Furthermore, the same kernel with `block_size = 16` has no store/load bank conflicts:

mmul(const float *, const float *, float *, int) (512, 512, 1)x(16, 16, 1), Context 1, Stream 7, Device 0, CC 8.9
Section: Command line profiler metrics

Metric Name Metric Unit Metric Value

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0

Can anyone help explain what’s the root cause of bank conflicts for 32*32 tiling? Thanks!

shared memory bank conflicts are a difficult thing to capture and understand correctly using nsight compute. Some relevant comments are here.

1 Like