I’m trying to optimize a sgemm algorithm. I have two version of code (gemm_v4 and gemm_v5). In the gemm_v5, I try to increase the thread tile to reduce shared memory load.
I profile the code by using nsight compute, and found that the warp stall (espacially the short scoreboard stall) of gemm_v5 is indeed reduced. But the performance of gemm_v5 is worse than gemm_v4.
I attached the code and nsight compute profiling results below, can anyone help to analyze it and give me some advice? Thanks!
The input size of gemm: [1024, 1024] x [1024, 1024]
The hardware: GTX-1060 GPU
GPU Device Info.txt (29.8 KB)
The cuda version: cuda10.0
the nsight compute version: v1.0