Warp stall reduded but performance not improved

1921296014 · October 23, 2022, 8:15am

Hi,
I’m trying to optimize a sgemm algorithm. I have two version of code (gemm_v4 and gemm_v5). In the gemm_v5, I try to increase the thread tile to reduce shared memory load.
I profile the code by using nsight compute, and found that the warp stall (espacially the short scoreboard stall) of gemm_v5 is indeed reduced. But the performance of gemm_v5 is worse than gemm_v4.
I attached the code and nsight compute profiling results below, can anyone help to analyze it and give me some advice? Thanks!

gemm_v4.cu (3.7 KB)
gemm_v5.cu (4.0 KB)
gemm_v4.nsight-cuprof-report (1.8 MB)
gemm_v5.nsight-cuprof-report (2.6 MB)

some configations:
The input size of gemm: [1024, 1024] x [1024, 1024]
The hardware: GTX-1060 GPU
GPU Device Info.txt (29.8 KB)
The cuda version: cuda10.0
the nsight compute version: v1.0

Topic		Replies	Views
Speed regression for a pattern of sgemm in cuBLAS6 GPU-Accelerated Libraries	4	1377	May 28, 2014
Computation time of GTX 860M Announcements	1	1649	January 9, 2015
Unstable performance measured by cuda event CUDA Programming and Performance	3	450	December 6, 2022
How to analysis the stall wait in this HMMA case Nsight Compute	3	511	October 31, 2024
10x slowdowns on simple CUDA kernels when upgraded to 2060 RTX CUDA Programming and Performance	6	858	January 31, 2020
Sudden drop in CUDA/thrust perfomance CUDA Programming and Performance	4	1372	October 21, 2016
poor cgemm performance with cuda 3.0 CUDA Programming and Performance	12	5376	June 2, 2010
Oddly high regcounts in sm_70 compared to sm_61 CUDA Programming and Performance	6	1080	August 21, 2018
cuBLAS sgemm is slow CUDA Programming and Performance	4	2453	June 26, 2017
cublasSgemv performance question GPU-Accelerated Libraries	5	907	December 10, 2018

Warp stall reduded but performance not improved

Related topics