If NVINT8 exists, the performance is …
While studying NVIDIA’s new FP4 formats — MXFP4 and NVFP4 —
I started wondering:
What if the same quantization logic were applied not just to FP4, but to INT4 and INT8?
Would that make integer quantization even better?
So, I decided to try.
Code (CPU-based simulation, not TensorRT):
All quantization and evaluation were implemented in pure Python / NumPy.
No TensorRT or GPU quantization kernels were used — this is a CPU reference simulation to study numerical behavior.
Experiment Overview
I implemented MXINT4, NVINT4, MXINT8, NVINT8 based on the quantization behavior of MXFP4/NVFP4.
Then, I compared them against FP32 baselines using synthetic distributions representative of typical model activations (Transformer-like and Reasoning-like).
Test setup:
- Base precision: FP32 (main baseline).
- Distributions: FP32_FULL (= RangeProfile.FP32_IEEE; log-uniform sampled n=200,000, include_subnormals=False, max_abs_cap≈4.61e18), Transformer-like, Reasoning-like.
- Formats: MXFP4 / NVFP4 / MXINT4 / NVINT4 / MXINT8 / NVINT8
- Loops: 5 times (averaged). Throughput is measured in ms/loop on CPU (NumPy reference; not hardware-optimized).
- Metrics:
- Reconstruction error (MSE vs FP32)
- Quantize / Dequant throughput (ms/loop)
- Bits per value (and Total bits)
- Nonlinear layer fidelity (GELU, LayerNorm, Softmax):
- Softmax MSE, Softmax KL(p‖q), Softmax Top-1 delta rate (argmax change vs FP32)
Key Results
FP32_FULL (IEEE754 FP32 range, log-uniform sampled n=200,000)
| Scheme | MSE vs FP32 | bits/value | Total bits | Quantize ms/loop | Dequant ms/loop |
|---|---|---|---|---|---|
| MXFP4 | 1.013488e+33 | 4.25000 | 850000 | 3355.222306 | 128.279116 |
| NVFP4 | 9.001336e+32 | 4.50016 | 900032 | 569.166559 | 179.929585 |
| MXINT4 | 4.384233e+34 | 4.25000 | 850000 | 110.942443 | 47.343677 |
| NVINT4 | 8.583500e+32 | 4.50016 | 900032 | 281.392532 | 97.133989 |
| MXINT8 | 1.728744e+32 | 8.25000 | 1650000 | 110.809907 | 48.246607 |
| NVINT8 | 4.610962e+30 | 8.50016 | 1700032 | 285.039266 | 97.157737 |
→ NVINT8 achieved the smallest reconstruction error —
orders of magnitude lower than FP4 or INT4,
while maintaining similar quantization/dequantization speed.
Transformer-like (synthetic activations from GELU, LayerNorm, Softmax, n=200,000)
| Scheme | GELU MSE | LayerNorm MSE | Softmax MSE | Softmax KL(p‖q) | Softmax Top-1 Δ rate |
|---|---|---|---|---|---|
| MXFP4 | 0.015742 | 0.014989 | 0.000157 | 0.052713 | 0.224712 |
| NVFP4 | 0.008770 | 0.008159 | 0.000025 | 0.013377 | 0.099872 |
| MXINT4 | 0.023316 | 0.021552 | 0.000106 | 0.050271 | 0.199744 |
| NVINT4 | 0.010732 | 0.011094 | 0.000012 | 0.008405 | 0.065301 |
| MXINT8 | 0.006313 | 0.004819 | 0.000100 | 0.042190 | 0.119078 |
| NVINT8 | 0.000210 | 0.000139 | 0.000003 | 0.001137 | 0.021767 |
→ On Transformer-like distributions, NVINT8 almost perfectly reproduces FP32,
with negligible deviations in Softmax and LayerNorm outputs.
Reasoning-like (heavy-tailed activations, block-scale jitter, n=200,000)
| Scheme | GELU MSE | LayerNorm MSE | Softmax MSE | Softmax KL(p‖q) | Softmax Top-1 Δ rate |
|---|---|---|---|---|---|
| MXFP4 | 0.674760 | 0.022027 | 0.001473 | 0.324865 | 0.289373 |
| NVFP4 | 0.217546 | 0.008232 | 0.000399 | 0.094912 | 0.096671 |
| MXINT4 | 0.725123 | 0.025214 | 0.000848 | 0.241636 | 0.165173 |
| NVINT4 | 0.335547 | 0.013386 | 0.000163 | 0.035941 | 0.064661 |
| MXINT8 | 0.374638 | 0.007906 | 0.000813 | 0.192708 | 0.132522 |
| NVINT8 | 0.005198 | 0.000145 | 0.000025 | 0.005392 | 0.018566 |
→ In reasoning-like (dynamic, scale-varying) contexts,
NVINT8 again shows near-zero functional deviation,
preserving FP32 behavior across GELU, normalization, and Softmax layers.
Thoughts
It’s just a hypothesis —
maybe my implementation has mistakes, or the logic is flawed somewhere.
If NVINT8 truly existed, wouldn’t that be amazing?
FP32-like accuracy, INT8 efficiency — and almost no quantization loss across GELU, LayerNorm, and Softmax.
Notes
- Implemented in Python / NumPy, following MXFP4 and NVFP4 scaling behavior.
- All tests simulated on CPU with n=200,000 samples (log-uniform FP32 range).
Discussion welcome
This is not a definitive result — just an experiment and a thought exercise.
I’d really appreciate any feedback, corrections, or insights from others who’ve worked with INT quantization kernels or TensorRT calibration.
Maybe there’s a bug.
Maybe it’s an accidental truth.
Either way — I’d love to hear your thoughts.
Summary
- NVFP4/NVINT4 already show better numerical stability than MX variants.
- NVINT8 (hypothetical) achieves FP32-level reconstruction and near-perfect Softmax behavior.
- If real, it could be ideal for reasoning-heavy or LLM workloads.