If NVINT8 exists, the performance is …

If NVINT8 exists, the performance is …

While studying NVIDIA’s new FP4 formats — MXFP4 and NVFP4 —
I started wondering:

What if the same quantization logic were applied not just to FP4, but to INT4 and INT8?
Would that make integer quantization even better?

So, I decided to try.

Code (CPU-based simulation, not TensorRT):

All quantization and evaluation were implemented in pure Python / NumPy.
No TensorRT or GPU quantization kernels were used — this is a CPU reference simulation to study numerical behavior.


Experiment Overview

I implemented MXINT4, NVINT4, MXINT8, NVINT8 based on the quantization behavior of MXFP4/NVFP4.
Then, I compared them against FP32 baselines using synthetic distributions representative of typical model activations (Transformer-like and Reasoning-like).

Test setup:

  • Base precision: FP32 (main baseline).
  • Distributions: FP32_FULL (= RangeProfile.FP32_IEEE; log-uniform sampled n=200,000, include_subnormals=False, max_abs_cap≈4.61e18), Transformer-like, Reasoning-like.
  • Formats: MXFP4 / NVFP4 / MXINT4 / NVINT4 / MXINT8 / NVINT8
  • Loops: 5 times (averaged). Throughput is measured in ms/loop on CPU (NumPy reference; not hardware-optimized).
  • Metrics:
    • Reconstruction error (MSE vs FP32)
    • Quantize / Dequant throughput (ms/loop)
    • Bits per value (and Total bits)
    • Nonlinear layer fidelity (GELU, LayerNorm, Softmax):
      • Softmax MSE, Softmax KL(p‖q), Softmax Top-1 delta rate (argmax change vs FP32)

Key Results

FP32_FULL (IEEE754 FP32 range, log-uniform sampled n=200,000)

Scheme MSE vs FP32 bits/value Total bits Quantize ms/loop Dequant ms/loop
MXFP4 1.013488e+33 4.25000 850000 3355.222306 128.279116
NVFP4 9.001336e+32 4.50016 900032 569.166559 179.929585
MXINT4 4.384233e+34 4.25000 850000 110.942443 47.343677
NVINT4 8.583500e+32 4.50016 900032 281.392532 97.133989
MXINT8 1.728744e+32 8.25000 1650000 110.809907 48.246607
NVINT8 4.610962e+30 8.50016 1700032 285.039266 97.157737

NVINT8 achieved the smallest reconstruction error —
orders of magnitude lower than FP4 or INT4,
while maintaining similar quantization/dequantization speed.


Transformer-like (synthetic activations from GELU, LayerNorm, Softmax, n=200,000)

Scheme GELU MSE LayerNorm MSE Softmax MSE Softmax KL(p‖q) Softmax Top-1 Δ rate
MXFP4 0.015742 0.014989 0.000157 0.052713 0.224712
NVFP4 0.008770 0.008159 0.000025 0.013377 0.099872
MXINT4 0.023316 0.021552 0.000106 0.050271 0.199744
NVINT4 0.010732 0.011094 0.000012 0.008405 0.065301
MXINT8 0.006313 0.004819 0.000100 0.042190 0.119078
NVINT8 0.000210 0.000139 0.000003 0.001137 0.021767

→ On Transformer-like distributions, NVINT8 almost perfectly reproduces FP32,
with negligible deviations in Softmax and LayerNorm outputs.


Reasoning-like (heavy-tailed activations, block-scale jitter, n=200,000)

Scheme GELU MSE LayerNorm MSE Softmax MSE Softmax KL(p‖q) Softmax Top-1 Δ rate
MXFP4 0.674760 0.022027 0.001473 0.324865 0.289373
NVFP4 0.217546 0.008232 0.000399 0.094912 0.096671
MXINT4 0.725123 0.025214 0.000848 0.241636 0.165173
NVINT4 0.335547 0.013386 0.000163 0.035941 0.064661
MXINT8 0.374638 0.007906 0.000813 0.192708 0.132522
NVINT8 0.005198 0.000145 0.000025 0.005392 0.018566

→ In reasoning-like (dynamic, scale-varying) contexts,
NVINT8 again shows near-zero functional deviation,
preserving FP32 behavior across GELU, normalization, and Softmax layers.


Thoughts

It’s just a hypothesis —
maybe my implementation has mistakes, or the logic is flawed somewhere.

If NVINT8 truly existed, wouldn’t that be amazing?
FP32-like accuracy, INT8 efficiency — and almost no quantization loss across GELU, LayerNorm, and Softmax.


Notes

  • Implemented in Python / NumPy, following MXFP4 and NVFP4 scaling behavior.
  • All tests simulated on CPU with n=200,000 samples (log-uniform FP32 range).

Discussion welcome

This is not a definitive result — just an experiment and a thought exercise.
I’d really appreciate any feedback, corrections, or insights from others who’ve worked with INT quantization kernels or TensorRT calibration.

Maybe there’s a bug.
Maybe it’s an accidental truth.
Either way — I’d love to hear your thoughts.


Summary

  • NVFP4/NVINT4 already show better numerical stability than MX variants.
  • NVINT8 (hypothetical) achieves FP32-level reconstruction and near-perfect Softmax behavior.
  • If real, it could be ideal for reasoning-heavy or LLM workloads.

Hi @naisy ,

I am currently not sure if this would be within Tensorrt scope however would keep this thread open for teh comunity to respond.

thank you.

Hi, @AakankshaS ,

Thanks for keeping the thread open!
I understand this topic might not be exactly within the TensorRT scope,
but I wasn’t sure which category would fit best —
TensorRT seemed the most technically relevant place for this kind of quantization discussion.
Thank you.