Help on analysis & calculate Bank Conflicts of a naive GEMM implementation using CuTe

As a beginner for learning CUDA/CuTe, i create a naive GEMM program using CuTe and(NVidia RTX 3060 / CC=8.6), profile the test program using NSight Compute.

The test source localted at github repo.< cuda_perf/src/cute_gemm/gemm_tile_naive_cute.cu at master · HPC02/cuda_perf>

1. Program

The key points list below:

sizes: M =1024, N =1024, K =1024*8, BM=BN=64, BK=16, TM=TN=8.

Global Matrix:

  • A(M, K) : (1, M) K-Major
  • B(N, K) : (1, N) K-Major
  • C(M, N) : (1, M)

Dimention set:

  • gridDim(N / BN, M / BM)
  • blockDim(BN / TN, BM / TM)

Shared memory for thread block:

  • sA(BM, BK) : (1, BM)
  • sB(BN, BK) : (1, BN)

Matrix’s slice used by thread block:

Tensor gA = local_tile(mA, make_tile(Int<BM>{}, Int<BK>{}), make_coord(blockIdx.y, _));           // (BM, BK, k)
Tensor gB = local_tile(mB, make_tile(Int<BN>{}, Int<BK>{}), make_coord(blockIdx.x, _));           // (BN, BK, k)
Tensor gC = local_tile(mC, make_tile(Int<BM>{}, Int<BN>{}), make_coord(blockIdx.y, blockIdx.x));  // (BM, BN)

Calculated thread block slice:

  • gA(64, 16, 512) 1024* 8 / 16 = 512
  • gB(64, 16, 512) 1024* 8 / 16 = 512
  • gC(64, 64)

slice-k sub-tile:

Layout tC = make_layout(make_shape(Int<BM / TM>{}, Int<BN / TN>{}));

// local_partition: 按线程布局分配工作
// Step<_1, X> 表示第 0 维参与分区,第 1 维不参与
Tensor tCsA = local_partition(sA, tC, tid, Step<_1, X>{});   // (TM, BK)
Tensor tCsB = local_partition(sB, tC, tid, Step<X, _1>{});   // (TN, BK)
Tensor tCgC = local_partition(gC, tC, tid, Step<_1, _1>{});  // (TM, TN)

Calculated sub-tile for each thread in thread block:

tCsA shape: (_8,_16), stride: (_8,_64)
tCsB shape: (_8,_16), stride: (_8,_64)
tCgC shape: (_8,_8), stride: (_8,8192)

slice-k GEMM:

// 遍历K维度
const int num_tile_k = K / BK;
for (int k = 0; k < num_tile_k; k++) {
  // 从全局内存复制到共享内存
  copy(tAgA(_, _, k), tAsA);
  copy(tBgB(_, _, k), tBsB);

  __syncthreads();  // 等待所有线程完成复制

  // 使用 cute::gemm 执行矩阵乘法
  // gemm 期望: A(M,K), B(N,K), C(M,N) - B是(N,K)形式
  gemm(tCsA, tCsB, tCrC);  // tCrC += tCsA * tCsB^T

  __syncthreads();  // 等待所有线程完成计算
}

2. Profile

ncu’s Memory Workload section:

Warp State Staticstics:

Source view for L1 Conflicts Shared N-Way:

3. Question

After calculation, it seems no bank conflicts found in the kernel function. How to understand the Shared Load Matrix’s bank conflicts=2264 in the section Memory Workload ?

The 2nd question is: the Stall MIO Throttle is come from GMEM or from SMEM ?

Last question is: How to understand L1 Conflicts Shared N-Way from the source view ?