Warp stall reduded but performance not improved

I’m trying to optimize a sgemm algorithm. I have two version of code (gemm_v4 and gemm_v5). In the gemm_v5, I try to increase the thread tile to reduce shared memory load.
I profile the code by using nsight compute, and found that the warp stall (espacially the short scoreboard stall) of gemm_v5 is indeed reduced. But the performance of gemm_v5 is worse than gemm_v4.
I attached the code and nsight compute profiling results below, can anyone help to analyze it and give me some advice? Thanks!

gemm_v4.cu (3.7 KB)
gemm_v5.cu (4.0 KB)
gemm_v4.nsight-cuprof-report (1.8 MB)
gemm_v5.nsight-cuprof-report (2.6 MB)

some configations:
The input size of gemm: [1024, 1024] x [1024, 1024]
The hardware: GTX-1060 GPU
GPU Device Info.txt (29.8 KB)
The cuda version: cuda10.0
the nsight compute version: v1.0