Have you read your own documentation? I have – it literally recommends --max-seq-len 2048
Your github README should be a single source of truth – otherwise add a warning – don’t trust any information published here its likely out of date and unreliable – can’t be both.
You posted here, asking for help testing, I tested in good faith, but you expect me to scour your forums to figure out why the recipe you published on the quick start guide doesn’t work and even that is too much to ask. Like I said this is my minimum effort expectation: update your README before you ask for help.
Its not like you don’t have from We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! I wasted a good week on the previous incarnation for the same reason.
And let me be clear, a rust vllm is a great idea. The cold start is really nice. Why focus on 100t/s on 35b I get that on vllm now. What I want is models that do real work not toys – 4k context feels like a toy. If that is your intent make it very clear.
What would really impress me is if you supported Robs work on @tenari PrismaQuant v2 NVFP4. That would close the circle.
From your current quick start guide
Your documentation


