As a beginner for learning CUDA/CuTe, i create a naive GEMM program using CuTe and(NVidia RTX 3060 / CC=8.6), profile the test program using NSight Compute.
The test source localted at github repo.< cuda_perf/src/cute_gemm/gemm_tile_naive_cute.cu at master · HPC02/cuda_perf>
1. Program
The key points list below:
sizes: M =1024, N =1024, K =1024*8, BM=BN=64, BK=16, TM=TN=8.
Global Matrix:
- A(M, K) : (1, M) K-Major
- B(N, K) : (1, N) K-Major
- C(M, N) : (1, M)
Dimention set:
- gridDim(N / BN, M / BM)
- blockDim(BN / TN, BM / TM)
Shared memory for thread block:
- sA(BM, BK) : (1, BM)
- sB(BN, BK) : (1, BN)
Matrix’s slice used by thread block:
Tensor gA = local_tile(mA, make_tile(Int<BM>{}, Int<BK>{}), make_coord(blockIdx.y, _)); // (BM, BK, k)
Tensor gB = local_tile(mB, make_tile(Int<BN>{}, Int<BK>{}), make_coord(blockIdx.x, _)); // (BN, BK, k)
Tensor gC = local_tile(mC, make_tile(Int<BM>{}, Int<BN>{}), make_coord(blockIdx.y, blockIdx.x)); // (BM, BN)
Calculated thread block slice:
- gA(64, 16, 512) 1024* 8 / 16 = 512
- gB(64, 16, 512) 1024* 8 / 16 = 512
- gC(64, 64)
slice-k sub-tile:
Layout tC = make_layout(make_shape(Int<BM / TM>{}, Int<BN / TN>{}));
// local_partition: 按线程布局分配工作
// Step<_1, X> 表示第 0 维参与分区,第 1 维不参与
Tensor tCsA = local_partition(sA, tC, tid, Step<_1, X>{}); // (TM, BK)
Tensor tCsB = local_partition(sB, tC, tid, Step<X, _1>{}); // (TN, BK)
Tensor tCgC = local_partition(gC, tC, tid, Step<_1, _1>{}); // (TM, TN)
Calculated sub-tile for each thread in thread block:
tCsA shape: (_8,_16), stride: (_8,_64)
tCsB shape: (_8,_16), stride: (_8,_64)
tCgC shape: (_8,_8), stride: (_8,8192)
slice-k GEMM:
// 遍历K维度
const int num_tile_k = K / BK;
for (int k = 0; k < num_tile_k; k++) {
// 从全局内存复制到共享内存
copy(tAgA(_, _, k), tAsA);
copy(tBgB(_, _, k), tBsB);
__syncthreads(); // 等待所有线程完成复制
// 使用 cute::gemm 执行矩阵乘法
// gemm 期望: A(M,K), B(N,K), C(M,N) - B是(N,K)形式
gemm(tCsA, tCsB, tCrC); // tCrC += tCsA * tCsB^T
__syncthreads(); // 等待所有线程完成计算
}
2. Profile
ncu’s Memory Workload section:
Warp State Staticstics:
Source view for L1 Conflicts Shared N-Way:
3. Question
After calculation, it seems no bank conflicts found in the kernel function. How to understand the Shared Load Matrix’s bank conflicts=2264 in the section Memory Workload ?
The 2nd question is: the Stall MIO Throttle is come from GMEM or from SMEM ?
Last question is: How to understand L1 Conflicts Shared N-Way from the source view ?


