I agree, and I’m not saying the metrics aren’t useful. But without a good measure of accuracy, it’s hard to know what cost the improved speed comes and so it’s a difficult decision to make.
I haven’t used these models and I’m not suggesting the NVFP4 one is better, I’m just saying that without good data about accuracy, it’s not obvious to me that the fastest one is “better” (at least for my priorities).
Yeah, I absolutely get this (I’ve tried many times to come up with something to help me compare things), and I hope we can come up with something. But in the meantime, I’m just not putting much weight on speed. From the testing I’ve done so far, if you asked me if I wanted a faster model or a smarter model, I would pick smarter every time. There is no model I’ve run that is smart enough, even at a pathetically slow speed 😄
It might be worth trying Nemotron-3-super since it’s several times faster than other models. At that speed, performing manual corrections becomes a much more acceptable trade-off.
I didn’t try Nemotron-3-super because I saw “vLLM for it still in WIP”, and I am in a hurry to use a better model than I am currently using.
But model could only handle very simple prompt, and keeps crashing. When I lowered max-length to 8k, I was able to run a bit more “real” question, but couldn’t stay long to be used for daily.
I have two Blackwell 6000 Pro units, but I’m thinking about the same thing.
Currently we have all of those model options available:
- Qwen/Qwen3.5‑122B‑A10B‑FP8
- Qwen/Qwen3.5‑122B‑A10B‑GPTQ‑Int4
- Sehyo/Qwen3.5‑122B‑A10B‑NVFP4
- RedHatAI/Qwen3.5‑122B‑A10B‑NVFP4
- Intel/Qwen3.5‑122B‑A10B‑int4‑AutoRound
- QuantTrio/Qwen3.5‑122B‑A10B‑AWQ
So, for a production‑grade deployment of Blackwell serving around 50 concurrent users, which option provides the best combination of stability and performance?
It seems the model has the wrong orientation, either upside down or flipped on the diagonal. I’ve had repeated confusions where a lower-right part of the image is perceived by the model being on the top-left.
Are there any image/vision configurations that vllm needs?
It is TAX time. When using these models to do TAX work, I benchmarked a few models on financial statements from PDF to MD and then to QIF (Financial app importable format). Nemotron was a clear winner among all the models I benchmarked, speed wise and quality wise. in MD → QIF work, Nemotron-3-super took 1/4 of what Gemma4-31b took…