Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

Originally published at: https://developer.nvidia.com/blog/per-tensor-and-per-block-scaling-strategies-for-effective-fp8-training/

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows. This…