Help on analysis & calculate Bank Conflicts of a naive GEMM implementation using CuTe

roderick · January 15, 2026, 1:13pm

As a beginner for learning CUDA/CuTe, i create a naive GEMM program using CuTe and(NVidia RTX 3060 / CC=8.6), profile the test program using NSight Compute.

The test source localted at github repo.< cuda_perf/src/cute_gemm/gemm_tile_naive_cute.cu at master · HPC02/cuda_perf>

1. Program

The key points list below:

sizes: M =1024, N =1024, K =1024*8, BM=BN=64, BK=16, TM=TN=8.

Global Matrix:

A(M, K) : (1, M) K-Major
B(N, K) : (1, N) K-Major
C(M, N) : (1, M)

Dimention set:

gridDim(N / BN, M / BM)
blockDim(BN / TN, BM / TM)

Shared memory for thread block:

sA(BM, BK) : (1, BM)
sB(BN, BK) : (1, BN)

Matrix’s slice used by thread block:

Tensor gA = local_tile(mA, make_tile(Int<BM>{}, Int<BK>{}), make_coord(blockIdx.y, _));           // (BM, BK, k)
Tensor gB = local_tile(mB, make_tile(Int<BN>{}, Int<BK>{}), make_coord(blockIdx.x, _));           // (BN, BK, k)
Tensor gC = local_tile(mC, make_tile(Int<BM>{}, Int<BN>{}), make_coord(blockIdx.y, blockIdx.x));  // (BM, BN)

Calculated thread block slice:

gA(64, 16, 512) 1024* 8 / 16 = 512
gB(64, 16, 512) 1024* 8 / 16 = 512
gC(64, 64)

slice-k sub-tile:

Layout tC = make_layout(make_shape(Int<BM / TM>{}, Int<BN / TN>{}));

// local_partition: 按线程布局分配工作
// Step<_1, X> 表示第 0 维参与分区，第 1 维不参与
Tensor tCsA = local_partition(sA, tC, tid, Step<_1, X>{});   // (TM, BK)
Tensor tCsB = local_partition(sB, tC, tid, Step<X, _1>{});   // (TN, BK)
Tensor tCgC = local_partition(gC, tC, tid, Step<_1, _1>{});  // (TM, TN)

Calculated sub-tile for each thread in thread block:

tCsA shape: (_8,_16), stride: (_8,_64)
tCsB shape: (_8,_16), stride: (_8,_64)
tCgC shape: (_8,_8), stride: (_8,8192)

slice-k GEMM:

// 遍历K维度
const int num_tile_k = K / BK;
for (int k = 0; k < num_tile_k; k++) {
  // 从全局内存复制到共享内存
  copy(tAgA(_, _, k), tAsA);
  copy(tBgB(_, _, k), tBsB);

  __syncthreads();  // 等待所有线程完成复制

  // 使用 cute::gemm 执行矩阵乘法
  // gemm 期望: A(M,K), B(N,K), C(M,N) - B是(N,K)形式
  gemm(tCsA, tCsB, tCrC);  // tCrC += tCsA * tCsB^T

  __syncthreads();  // 等待所有线程完成计算
}

2. Profile

ncu’s Memory Workload section:

Warp State Staticstics:

Source view for L1 Conflicts Shared N-Way:

3. Question

After calculation, it seems no bank conflicts found in the kernel function. How to understand the Shared Load Matrix’s bank conflicts=2264 in the section Memory Workload ?

The 2nd question is: the Stall MIO Throttle is come from GMEM or from SMEM ?

Last question is: How to understand L1 Conflicts Shared N-Way from the source view ?

Topic		Replies	Views
Bank conflicts confusion for tiled matrix multiplication CUDA Programming and Performance	1	329	February 27, 2024
Bank conflict of tiled matrix multiplication CUDA Programming and Performance cuda , kernel	1	711	May 17, 2023
Why there is random bank conflicts? CUDA-MEMCHECK cuda	2	1261	September 19, 2023
What is "Other Bank Conflicts" CUDA Programming and Performance	0	59	February 24, 2025
About bank conflict of shared_mem CUDA Programming and Performance	2	526	July 25, 2023
Shared memory matrix multiplication not working CUDA Programming and Performance	6	132	October 11, 2024
Matrix Multiplication Bank conflicts problem GPU-Accelerated Libraries	17	1989	August 31, 2018
Is there any way to find out the location in cuda code that cause shared memory bank conflicts? CUDA Programming and Performance	6	1281	January 21, 2022
Shared memory bank conflict CUDA Programming and Performance	4	526	July 30, 2025
How to understand the bank conflict of shared_mem CUDA Programming and Performance	16	14900	November 19, 2025

Help on analysis & calculate Bank Conflicts of a naive GEMM implementation using CuTe

1. Program

2. Profile

3. Question

Related topics