Why the performance of tf32 tensor_core is poor?

Greg · August 3, 2023, 5:38pm

The A4000 (GA104) based GPU has a sustained throughput of 0.5 Load Store Unit (LSU) instructions/cycle and 1 LSU wavefront/cycle. The ampere_sgemm_128x32_nn is utilizing shared memory and global memory very efficiently; however, the performance is limited by the LSU.

The GPU Speed of Light SOL SM and SOL Memory have the same value (87.14%). For CC 7.0 - 9.0 the LSU instruction throughput and the LSU request throughput are the same. The former is part of sm__instruction_throughput and the latter the gpu__compute_memory_throughput.

sm__inst_executed_pipe_lsu.sum approx= SharedMemory::Instructions[Total] + L1/TEX Cache::Instructions[Total]
437,551,104 = 353,140,736 + 84,410,368

The report did not maintain the value of sm__cycles_elapsed.sum but it can be approximated as gpc__cycles_elapsed.max x 48 (SM count) = 20,955,367 x 48 = 1,005,857,616

sm__inst_executed_pipe_lsu.avg.peak_sustained = 0.5

437,551,104 / (1,005,857,616 x 0.5) = 87%

CC 7.0 (GV100), CC 8.0 (GA100) and CC 9.0 (GH100) can sustain 1 LSU instruction/cycle/SM.

For FP32 and TF32 the critical issue is shared memory instruction throughput. The kernel is already efficiently using LDS.128 and STS.128 for many of the accesses.

Topic		Replies	Views
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	26919	December 23, 2014
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2246	January 18, 2023
Theoretical ON-CHIP Bandwidth how to determine? CUDA Programming and Performance	15	11585	October 16, 2009
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37400	August 30, 2009
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2467	October 19, 2019
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1197	April 17, 2023
Which GPU for best performance with TCC and CUDA cores (no tensors) CUDA Programming and Performance	30	364	December 6, 2024
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11159	May 23, 2010
theoretical/real shared/dram peak memory throughput CUDA Programming and Performance	12	4996	January 5, 2017
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011

Why the performance of tf32 tensor_core is poor?

Related topics