Originally published at: Introducing NVFP4 for Efficient and Accurate Low-Precision Inference | NVIDIA Technical Blog
To get the most out of AI, optimizations are critical. When developers think about optimizing AI models for inference, model compression techniques—such as quantization, distillation, and pruning—typically come to mind. The most common of the three, without a doubt, is quantization. This is typically due to its post-optimization task-specific accuracy performance and broad choice of…
vLLM does not really support NVFP4 still. I’m unable to run NVFP4/Qwen3-Coder-30B-A3B-Instruct-FP4 on my DGX Spark using nightly vllm image.
Why is there a sign bit on the scale factor? It seems redundant.
Please don’t unnecessarily add animation to your images. There’s a place for such things, but not with those. You already have left-to-right flow as part of the image……causing the image to have parts disappear and reappear greatly hampers the reading of it.