Some say mixed-precision stability is solved — others still hit NaNs daily. What’s your reality?
With FP8 adoption accelerating and H100s everywhere, I’m seeing mixed reports on stability in mixed-precision training.
I’m researching training stability challenges (and solutions) in FP16 and FP8 workloads — especially cases involving gradient underflow, NaNs, or divergence.
I’d love to hear both sides:
-
What’s working perfectly “out of the box” for you?
-
Have you hit stability issues — if so, what was the fix (or the breaking point)?
-
Roughly how much compute time or cost did instability cost you?
The goal is to collect real-world configurations, tips, and pitfalls so others can avoid wasted runs and get to stable results faster. I’ll compile anonymized results into a public summary so we can all benefit.