Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

jwitsoe · July 1, 2025, 6:13pm

Originally published at: https://developer.nvidia.com/blog/per-tensor-and-per-block-scaling-strategies-for-effective-fp8-training/

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows. This…

Topic		Replies	Views
Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training Technical Blog	1	25	June 4, 2025
Can hopper support recent published 1D scaling of FP8 in cuBlasLt GPU-Accelerated Libraries cublas	1	36	February 26, 2025
Will FP8 releated schema in TransformerEngine upstream to PyTorch? GPU-Accelerated Libraries	1	368	January 18, 2024
How to load fp8 using ldmatrix on sm120/sm120a CUDA Programming and Performance cuda	8	190	April 16, 2025
Run ptx (mma.sync.aligned.kind::mxf8f6f4.block_scale.scale_vec::1X.m16n8k32) on sm_120a CUDA Programming and Performance	1	88	April 9, 2025
NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI Technical Blog	1	522	September 18, 2022
tensorRT FP8 support TensorRT tensorrt	2	2795	June 21, 2023
Fp8 conversion performance makes it slower than float16 CUDA Programming and Performance	7	278	March 3, 2025
Unable to quantization FP8 in TensorRT TensorRT tensorrt	1	549	June 20, 2023
TensorRT 10.2 is not using FP8 convolution tactics when building a FP8 quantized conv model TensorRT tensorrt , tensorrt-model-optimizer	2	248	July 10, 2024

Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

Related topics