Why the performance of tf32 tensor_core is poor?

The A4000 (GA104) based GPU has a sustained throughput of 0.5 Load Store Unit (LSU) instructions/cycle and 1 LSU wavefront/cycle. The ampere_sgemm_128x32_nn is utilizing shared memory and global memory very efficiently; however, the performance is limited by the LSU.

The GPU Speed of Light SOL SM and SOL Memory have the same value (87.14%). For CC 7.0 - 9.0 the LSU instruction throughput and the LSU request throughput are the same. The former is part of sm__instruction_throughput and the latter the gpu__compute_memory_throughput.

sm__inst_executed_pipe_lsu.sum approx= SharedMemory::Instructions[Total] + L1/TEX Cache::Instructions[Total]
437,551,104 = 353,140,736 + 84,410,368

The report did not maintain the value of sm__cycles_elapsed.sum but it can be approximated as gpc__cycles_elapsed.max x 48 (SM count) = 20,955,367 x 48 = 1,005,857,616

sm__inst_executed_pipe_lsu.avg.peak_sustained = 0.5

437,551,104 / (1,005,857,616 x 0.5) = 87%

CC 7.0 (GV100), CC 8.0 (GA100) and CC 9.0 (GH100) can sustain 1 LSU instruction/cycle/SM.

For FP32 and TF32 the critical issue is shared memory instruction throughput. The kernel is already efficiently using LDS.128 and STS.128 for many of the accesses.

1 Like