The A4000 (GA104) based GPU has a sustained throughput of 0.5 Load Store Unit (LSU) instructions/cycle and 1 LSU wavefront/cycle. The ampere_sgemm_128x32_nn is utilizing shared memory and global memory very efficiently; however, the performance is limited by the LSU.
The GPU Speed of Light SOL SM and SOL Memory have the same value (87.14%). For CC 7.0 - 9.0 the LSU instruction throughput and the LSU request throughput are the same. The former is part of sm__instruction_throughput and the latter the gpu__compute_memory_throughput.
sm__inst_executed_pipe_lsu.sum approx= SharedMemory::Instructions[Total] + L1/TEX Cache::Instructions[Total]
437,551,104 = 353,140,736 + 84,410,368
The report did not maintain the value of sm__cycles_elapsed.sum but it can be approximated as gpc__cycles_elapsed.max x 48 (SM count) = 20,955,367 x 48 = 1,005,857,616
sm__inst_executed_pipe_lsu.avg.peak_sustained = 0.5
437,551,104 / (1,005,857,616 x 0.5) = 87%
CC 7.0 (GV100), CC 8.0 (GA100) and CC 9.0 (GH100) can sustain 1 LSU instruction/cycle/SM.
For FP32 and TF32 the critical issue is shared memory instruction throughput. The kernel is already efficiently using LDS.128 and STS.128 for many of the accesses.