FP16/ FP8 Training Stability — What’s Working and What’s Failing?

Some say mixed-precision stability is solved — others still hit NaNs daily. What’s your reality?

With FP8 adoption accelerating and H100s everywhere, I’m seeing mixed reports on stability in mixed-precision training.

I’m researching training stability challenges (and solutions) in FP16 and FP8 workloads — especially cases involving gradient underflow, NaNs, or divergence.

I’d love to hear both sides:

  • What’s working perfectly “out of the box” for you?

  • Have you hit stability issues — if so, what was the fix (or the breaking point)?

  • Roughly how much compute time or cost did instability cost you?

The goal is to collect real-world configurations, tips, and pitfalls so others can avoid wasted runs and get to stable results faster. I’ll compile anonymized results into a public summary so we can all benefit.

To kick things off:
I’ve personally seen FP16 runs stay rock-solid with PyTorch AMP + dynamic loss scaling — but FP8 gave me NaNs in under 1k steps on a transformer model (H100, CUDA 12.x, PyTorch 2.x).

Curious if that’s just me or if others have seen the same.

If you’ve had FP8 runs stay stable:

  • What loss scaling, optimizer settings, or precision-preserving tricks made it work?

  • Any specific driver / CUDA / cuDNN combos that helped?

  • How far (in steps/epochs) have you pushed it without divergence?

I’m especially interested in configs that survived >100k steps — and equally in horror stories where things blew up early, so we can spot patterns.