VLLM -- the $150M train wreck?

AoE · February 26, 2026, 8:50am

I should have been more explicit. The name of the model is wrong.

erik.vullings · February 26, 2026, 1:29pm

Both would be nice, but automatic deployment would be preferable - that’s what I meant with the mod or option. E.g. could you add a recipe option, such as –litellm, and the recipe would also setup a LiteLLM proxy.
When configuring LiteLLM manually, for example, initially I only specified one model (Opus), but even though I selected Opus in claude code, haiku and sonnet were called too, and those calls failed as I had not specified them. So you always need to include all three.

cosinus · February 26, 2026, 3:24pm

You could try my little helper for LiteLLM updates.

It was intended to be run in a docker compose file together with the vLLM/llama.cpp process. Each time you start the whole stack it waits until vLLM is started and then sends an update to a running LiteLLM in database mode. It terminates after the change.

eugr · February 26, 2026, 4:37pm

BTW, you don’t need to do it. Just set environment variables for your client. Then all you need is /messages compatible endpoint. E.g. I have this helper script on my Mac:

#!/bin/bash
DEFAULT_BASE_URL="http://spark:8888"
export LITELLM_API_KEY="none"
export ANTHROPIC_BASE_URL="${2:-$DEFAULT_BASE_URL}"
export ANTHROPIC_AUTH_TOKEN="$LITELLM_API_KEY"
export ANTHROPIC_MODEL=$1
export ANTHROPIC_SMALL_FAST_MODEL=$1
export ANTHROPIC_DEFAULT_HAIKU_MODEL=$1
export ANTHROPIC_DEFAULT_SONNET_MODEL="$1"
export ANTHROPIC_DEFAULT_OPUS_MODEL="$1"
export CLAUDE_CODE_ATTRIBUTION_HEADER=0
claude

I can call it like this:

~/claude_local.sh Qwen/Qwen3-Coder-Next-80B-FP8

And it will work. Or like this if I want to go through LiteLLM proxy:

~/claude_local.sh Qwen/Qwen3-Coder-Next-80B-FP8 https://llm-proxy:4000

erik.vullings · February 27, 2026, 5:22pm

I’ve followed your lead, and setup a simple litellm proxy, and am currently running it with Qwen3.5-35B-A3B-FP8 just fine! I’ve also created a pull request for a new recipe, allowing others to run it easily too.

Topic		Replies	Views
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	1974	November 29, 2025
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	72	5794	March 23, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2425	December 31, 2025
Run VLLM in Spark DGX Spark / GB10	143	10850	January 31, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4579	December 9, 2025
vLLM container out of date for new models DGX Spark / GB10	10	1752	November 14, 2025
Who wants to be the hero and help a total newbie! Got a spark and um, yeah DGX Spark / GB10 nemotron	1	143	March 18, 2026
vLLM containers DGX Spark / GB10	11	457	March 10, 2026
GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...? DGX Spark / GB10	90	3973	February 27, 2026
Some new development work for Qwen3 on the Spark DGX Spark / GB10	5	630	February 3, 2026

VLLM -- the $150M train wreck?

Related topics