Yes, it works quite well with vLLM as it supports Anthropic’s /messages endpoint. I mostly used it through my LiteLLM proxy though, as I have multiple endpoints converging there, so it’s more convenient.
Here is my helper script (claude_code_vllm.sh) that I can run with a model name as an argument, and it will work with local models served by vLLM:
Working on my config, but here’s where I’m currently at:
1 Spark almost makes this impossible. llama.cpp is nice for the speed but results in far too many tool calling errors. vLLM is what you want to run but you can’t run the UD-Q3 quants for models worthwhile (I still don’t understand how folks are getting away with gpt-oss here). Had to pick up a 2nd Spark and then things started to click.
The dual setup lets you load up a worthwhile model with enough memory for 5+ concurrent caches. This is really where the setup shines.
Llama-swap for routing. This took some setup to make dynamic with the Ray cluster, especially since current vision models require a different image. Not elegant and this will need to be updated as things progress.
Opencode for platform. Openspec for task mgmt.
I did an 80/20 on the Confucius paper to figure out a bare-bones scaffold. An exorbitant amount of hours later + the meta agent → sort of learning how to system prompt. Also finding that slash commands are very useful for common patterns/tasks/arch setup.
Creating namespaced subdomains within the org chart. Currently have a pretty decent research and code arrangement. My VLM-backed OCR pipeline needs more love.
Using Docker’s MCP-gateway as much as possible as the proxy for MCP servers. Obviously the servers are dependent on your needs.
Have a very customized Searxng instance + an MCP server for it. This alongside Markitdown are almost mandatory.
Have also messed around with Playwright but need to get further here. For now that one stays turned off as it isn’t a direct fit for the projs I’m focused on.
Need to implement CI/CD with my Gitea instance as a check gate and form of feedback for the agents.
Still a lot to learn and figure out, but I can say that so far I’ve been able to walk away (sort of), let the namespaced managers handle their business, and ~2hrs later find (broken) features built.
I just noticed on the Unsloth site that they claim to have fixed the issues with Qwen3 Coder and tool use, which might be useful for folks on this thread. “You can now use tool-calling seamlessly in llama.cpp, Ollama, LMStudio, Open WebUI, Jan etc. This issue was universal and affected all uploads (not just Unsloth), and we’ve communicated with the Qwen team about our fixes!”
I started reading from the first post, and there have been so many changes in the last six months. I am currently working on a personal project that involves vibe coding with a single DGX SPARK to build an LLM inference engine, and it has definitely improved compared to before. From my personal experience, Gemma4-31B is excellent at understanding user intent and drafting plans, while Qwen3.6-27B excels more in code generation. The drawback is that as a DENSE model, its speed is slow, but using MTP with llama.cpp has made the performance manageable.
I ran Gemma4 using vLLM within Docker. For the options, I just went with the common configurations used here on this forum.
As for Qwen3.6, I’m currently using llama.cpp as follows. It might not be perfectly optimized, but it works well enough for my needs, so I’m sticking with it.
I build it by pulling the llama.cpp branch 22673 like this:
My agent environment isn’t anything fancy. When using Gemma4, I integrated it with the Zed editor, and lately, for Qwen3.6, I’ve been using VS Code with the Cline extension. For me, the model’s ability to perform tasks consistently is more important than any specific app or agent. To achieve this, whenever I start a new session (in Zed) or task (in VS Code), I provide the AI with a “start keyword.” This prompts it to read a series of documents that act as a sort of “boot sequence,” allowing it to pick up where it left off and maintain continuous progress on the project.
This is super interesting! So when a task finishes, it leaves an artifact behind that the next session picks up in its boot sequence? How much detail do you put in there?
I’ve found that managing memory and states to be a challenge, so always interested in other people’s approaches. :)
This is so cool!! Thanks for nerdng out. So on a new session. the agent reads this file (along with all the updated guides) as its boot sequence, correct? Do you find gaps in this, or is it pretty solid? I’m still tinkering and thinking through how to juggle context for long term goal vs short term task.
Also, do you find the local LLM (Qwen 3.6 27B Q8 in this case) capable of staying on rails, making good code, etc? Is it the infra you create that matters the most? Or the LLM’s quality?
I’m trying to go all local, and seeing how to best approach, as I’m just starting out.
with num_speculative_tokens = 1 no errors, but the LLM speed is not so pleasant
with num_speculative_tokens > 1 a lot of ‘Unable to parse JSON string error’ while working with Kubernetes manifests
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] Error in chat completion stream generator.
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] Traceback (most recent call last):
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py”, line 1253, in chat_completion_stream_generator
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] actual_call = tool_parser.streamed_args_for_tool[index]
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
(APIServer pid=116) ERROR 05-10 21:21:27 [serving.py:1371] IndexError: list index out of range
Exactly. After terminating a session, I follow a specific startup procedure to have the AI review the core rulebooks, foundational docs, and logs from the previous session. This works quite well for carrying over tasks into a new session, and the AI is even able to automatically update the necessary documentation itself.
However, there are significant drawbacks. First, as the project grows, the documentation required for this startup sequence also expands. Consequently, a massive amount of context is consumed just to get the session running.
Second, despite these precautions, the model begins to ‘forget’ the initial instructions as the session progresses. When combined with high context usage, this leads to severe hallucinations and repetitive mistakes. If you look at my logs, you’ll see the word ‘surgical’ being used repeatedly—even though I explicitly instructed it not to overuse that term. At that point, my only option is a hard reset of the session.
Lastly, this method doesn’t guarantee long-term stability across subsequent sessions. If the AI takes shortcuts instead of following best practices to solve an immediate problem, it eventually causes issues that surface several sessions later, dragging everything back to square one. In my termination logs, you’ll notice some phases starting with ‘ex’ rather than numbers; this is a direct result of having had to reset and repeat those specific phases multiple times.
In short, while my method might work for short-term tasks, it’s not sustainable for long-term projects. Since this is just a hobby, I don’t mind, but if this were professional work, I would have quit long ago. Even my goal of building an LLM inference engine has been stuck in repeated loops, particularly during the core GEMM implementation phase.