So when I first set up my machine, --kv-cache-dtype fp8 was the default in all my recipes. I had no understanding that bf16 was an option worth exploration. The discussions here, no doubt influenced by the Nvidia DGX Spark marketing was that we should be aiming for --kv-cache-dtype NVFP4. Then I had a discussion with @eugr and @flash3 about all the tool call failures I was experiencing and the eventual outcome was a suggestion to use bfloat16 with int4 AutoRound (which I had never heard of) with Qwen3 Coder Next. This was long before AutoRound had gained popularity. At the time vllm support for this configuration seemed nascent.
Back in Feb the dominant conversations on this forum was almost exclusively focused on t/s. I found this a real time sink because almost everything I tried couldnât sustain long contexts, instruction following and tool calls. Then Qwen3.5 dropped and while I was attempting to quantise it to AutoRound myself, Intel released their versions in short succession.
It takes a lot of time to download models, wait 10 minutes for vllm to startup, run a 30 min process, tweak a setting, wait 10 minutes for vllm to startup again, re-run a 30 min process, compare results - rinse and repeat.
It may seem perfunctory now to be discussing these settings, but back in Feb this really wasnât a conversation that was being had â at least not one visible to me. I had clients who needed deliverables and I was lost down endless rabbit holes getting nowhere â mad how fast moving this has been.
I started this thread because I figured other members might also be tired of wasting time and getting nowhere too. I have learned a lot from everyone in the meantime. I have a daily driver setup now that I am satisfied with, and still keenly experimenting with PrismaQuant v2 â so I guess a foot in each camp ;)
@whpthomas for tool calling have you ever switched between chat-template-content-format string and openai and tried around with that? Just noticed that with the latest qwen-3.6-enhanced vLLM auto detects string mode, whereas before it was using openai mode which hade more problems during tool calling for me.
This type of posts is why in the IT industry I donât like when people say âBest practiceâ. I encourage everyone to say âLead Practiceâ instead.
Your example is the best one âAt the time, the lead practice was to run Bfloat16â, everything evolves quickly! Now the lead practice might be different :)
Also, Iâm learning that everyoneâs workflow is different around here, the recipes that worked the best for me were yours, but in my case (single user) tok/s are more important than in your case that I saw processing 15-20 parallel PDFs where you can benefit of parallelism.
LLMs are non-deterministic in content, response time, in basically everything. the structure of this jelly is being tinkered with on a weekly basis â thereâs new DeltaNets and whatever else, which everyone then has to master overnight because itâs the hip thing right now. space problems at every corner, quantization breaks even the tooling, itâs like sticking a structurally stable cookie into the jelly and expecting it to wobble less afterwards.
DeepJelly recently dropped 1.6T of jelly, nobody knows where to put it. eating it normally isnât an option. and over in the diet section, which is exactly where we are, people are thrilled they can even manage to nibble a bit of jelly off the tiniest fork. of course recipes get swapped. survival tips. nutshell at Cape Horn. might work out. but the attrition is high, since plenty of assumptions donât pan out.
the jelly doesnât make you smart. it only works when itâs corseted by tooling and system prompts. the best jelly is the one you donât have to talk to, because otherwise you might get annoyed all over again. the life-time budget being burned through here is approaching the Guinness record for best phantom productivity ever. and thatâs before we even get to the perception distortion, because some folks already use the jelly to button up their pants.
the question is: do I think it through myself for a moment, or do I ask the jelly? and how do I know the jelly is right if I havenât thought it through myself beforehand. better to ask first â then new ideas come up.
⊠and maybe Iâll tune it a bit more so it answers faster. but is it answering correctly then? for that Iâd really need to be sure and ask the untuned version again beforehand â but was that one even right to begin with?
Personally I mostly use SDX Spark for coding, what I am coding also uses DGX Spark for inference. Ultimately thats applied research into agentic patterns that work reliably for air-gapped business automations. So long context, long running, instruction following, tool calling and multi-modal.
Henry, I am trying to run your latest recipe with latest Eugr setup, and it crashes something hard. Any chance you have been doing some updates/changes not captured in this thread ? I am on the latest OS release. Thanks
⊠have been rigorously tested on concurrent workloads. Context rot sets in at about 130k so you donât need more then 196K. 32768 batched token is marginally faster but prone to OOM errors when saturated. 16 seqs in combination with 16384 runs the cache at about 86% leaving headroom for surges.
You can push hard all day with this setup reliably on a single DGX spark.
From the documentation, vLLM basically converts the input to the openai format, if that format is used, otherwise it just passes on the string from the LLM
The format to render message content within a chat template. * âstringâ will render the content as a string. Example: âHello Worldâ * âopenaiâ will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{âtypeâ: âtextâ, âtextâ: âHello world!â}]
Ok I understand now, the difference between JSON format and markdown, I didnât realised the term for this was chat-template-content-format but that makes perfect sense now. So do coding harnesses like OpenCode have to be configured for this or do they detect it? I know there is also --tool-call-parser qwen3_xml but I could never get that to work, so I just assumed it was a protocol, the harness either expects one or the other, but I never looked any deeper into it.
I havent done much testing with this difference yet, I just noticed it in the startup logs when I switched to the qwen-3.6 enhanced template, for the other templates it usually selected the openai mode automatically. In any case this template did fix nearly all tool calling issues for me. Not sure if the template option has anything to do with it as well yet :)
Weirdly I was using the 3.6 enhanced template up until last week, and after updating spark-vllm-docker everything broke. I went through disabling settings one by one and removing the Qwen-3.6 enhanced template fixed my tool calls. So I just ran with that for the time being. A lot of this for me is trial and error. If it works, I try to leave it alone for as long as I can. Then Rob released PrimaQuant v2 and I updated and everything works slightly differently. I know I should probably save different docker images and pair them with specific models, but there have been so many great improvements lately I keep rolling the dice.
I am also usually going this route, though I just number the new templates and cycle over the old images once I am no longer using them, its a habbit that formed in the early days, when you had to compile everything which took forever and stuff broke a lot more often. So I just have a couple ie vllm-node-tf5-X images non tf5 and some special custom PR templates to run stuff like google MTP and so on :)
Hey, folks! After extensive testing of MTP, I saw a drawback in setting num_speculative_tokens more than 1 - beacuse it breaks tool usage. Got a lot of opencode errors like âExpected âfunction.nameâ to be a string.â which break the loop and model stops. May be you know a workaround to such behavior?