Claude Code + VLLM on nvcr.io/nvidia/vllm

ignacioadriancabrera · June 28, 2026, 12:49am

Hi everyone,

I would like to share an open-source project I created to make it easy to use Claude Code with a local or LAN-hosted vLLM server:

Repository: Nachetecabrera/claude-vllm

The project provides a lightweight compatibility proxy between Claude Code and vLLM, including vLLM servers running locally, on another machine in the LAN, or through NVIDIA NGC-based deployments.

Simple installation on Windows, Linux and macOS

A major goal of the project is to avoid complicated manual configuration.

The repository includes native installation scripts for all three major desktop platforms.

Windows

.\install.ps1

Linux and macOS

chmod +x install.sh && ./install.sh

The installation process automatically:

Adds the project commands to the system PATH.
Installs and configures both proxy implementations.
Creates the claudevllm and claudevllmd commands.
Creates the spcvp and spcvn proxy startup commands.
Supports both the Python/FastAPI and Node.js/Fastify versions.

After restarting the terminal, the same commands can be used on Windows, Linux and macOS.

Basic configuration

The user only needs to copy the example configuration file and specify the vLLM server address.

For example:

{
  "listen_ip": "0.0.0.0",
  "listen_port": 8010,
  "forward_url": "http://VLLM_HOST:8000",
  "model": "qwen3-coder-next",
  "log_level": "INFO",
  "system_mode": "hoist",
  "drop_top_level_fields": "context_management,output_config,thinking",
  "drop_tool_fields": "strict,defer_loading",
  "strip_cache_control": true
}

The important values are:

forward_url: address of the local, LAN, VPN, Tailscale, or NGC vLLM server.
model: model name that the proxy should send to vLLM.
system_mode: how Claude Code system and developer messages should be normalized.

Claude Code can then be pointed to the local proxy using environment variables.

Windows PowerShell

$env:ANTHROPIC_BASE_URL = "http://localhost:8010"
$env:ANTHROPIC_API_KEY = "dummy"

Linux and macOS

export ANTHROPIC_BASE_URL="http://localhost:8010"
export ANTHROPIC_API_KEY="dummy"

These configuration options and commands are documented in the repository’s quick-start section.

Starting the proxy

Once installed, the user does not need to navigate through directories or manually launch Python or Node.js files.

The following global commands are available:

spcvp

Starts the Python/FastAPI proxy.

spcvn

Starts the Node.js/Fastify proxy.

Then Claude Code can be launched through:

claudevllm

Or with permission bypass mode:

claudevllmd

The README specifies that these commands work across Windows, Linux, and macOS after installation.

Why the proxy is useful

Some Claude Code requests are not accepted directly by certain versions of vLLM’s Anthropic-compatible endpoint.

For example, Claude Code may include system or developer roles inside the messages array, while the endpoint expects message roles to be only user or assistant.

This can produce validation errors such as:

Input should be 'user' or 'assistant'

Claude Code may also include fields such as:

context_management
output_config
thinking
cache_control

The proxy normalizes these requests before forwarding them to vLLM.

It can:

Move system and developer messages into the top-level system field.
Convert system messages into tagged user messages when required.
Remove unsupported top-level request fields.
Remove unsupported properties from tool definitions.
Recursively remove cache_control.
Force a specific model name.
Forward an upstream API key.
Preserve streaming responses.
Connect to a vLLM server running locally or elsewhere on the network.

The two supported system-message modes are hoist, which is recommended in the README, and user, which converts instructions to messages tagged with <system-update>.

Example architecture

Windows / Linux / macOS client
┌───────────────────────────────┐
│ Claude Code                   │
│ claude-vllm proxy             │
│ Python or Node.js             │
└───────────────┬───────────────┘
                │
                │ LAN / VPN / Tailscale
                ▼
┌───────────────────────────────┐
│ NVIDIA system                 │
│ vLLM server                   │
│ Local coding model            │
└───────────────────────────────┘

This makes it possible to run Claude Code on a regular Windows, Linux, or macOS computer while the model runs on a separate NVIDIA GPU workstation, server, or DGX system.

Health checks

The proxy also includes health endpoints:

curl http://localhost:8010/healthz

This verifies that the proxy is running.

curl http://localhost:8010/readyz

This verifies both the proxy and its connection to the upstream vLLM server.

Project scope

This is a compatibility proxy rather than a complete Anthropic-to-OpenAI translation layer.

It uses vLLM’s Anthropic-compatible interface and normalizes the portions of Claude Code requests that may otherwise cause validation errors.

The repository includes:

Windows, Linux, and macOS installers.
Python/FastAPI implementation.
Node.js/Fastify implementation.
Global startup commands.
Configurable request normalization.
Local and LAN vLLM support.
Health and readiness endpoints.
Environment-variable configuration.
MIT license.

Feedback, testing, issues, and pull requests are welcome, especially from users running:

NVIDIA DGX systems.
NVIDIA GPU workstations.
NVIDIA NGC vLLM containers.
Remote vLLM instances over LAN, VPN, or Tailscale.
Qwen, Nemotron, DeepSeek, and other coding models.
Claude Code clients on Windows, Linux, or macOS.

Repository: github.com/Nachetecabrera/claude-vllm

Zambonilli · June 28, 2026, 1:39am

Why not just submit PRs to fix the missing or broken pieces of the already existing anthropic messages API?

Topic		Replies	Views
Running a Local LLM with Claude Code and llama.cpp on Jetson Thor and RTX 5090 Jetson Thor llama , agentic-ai	1	3039	March 30, 2026
Claude Code >= 2.1.154 breaks with vLLM Anthropic-compatible endpoint on DGX Spark DGX Spark / GB10	2	1204	June 1, 2026
Docker Image: NVIDIA vLLM 0.23.0 with Claude Code 2.1.195+ Compatibility DGX Spark / GB10 spark , dgx	6	267	June 29, 2026
LangChain ChatAnthropic DGX Spark / GB10	1	177	June 15, 2026
Claude Code >= 2.1.154 Compatibility Issue Fixed in vLLM v0.23.0 (PR #44283) DGX Spark / GB10	2	249	June 16, 2026
Implementation Guide: DGX Spark with Qwen3.5-35B-A3B via llama.cpp for Claude Code DGX Spark / GB10 Projects llama , agentic-ai	3	1852	April 2, 2026
SparkRun Auto Model Registration with LiteLLM & Local Claude Code Setup DGX Spark / GB10 Projects inference-fil-spark , spark , nemotron	4	816	May 21, 2026
Local-first coding agent that auto-configures llama.cpp for maximum hardware performance Jetson Projects llm , llama , agentic-ai	0	540	April 13, 2026
vLLM returns 400 error for tool_choice="auto" when called from OpenClaw (Qwen3.5-35B on NVIDIA Spark GB10) DGX Spark / GB10 tools , docker , gpu-computing	2	1845	March 9, 2026
Total nightmare : NEMOCLAW over Paperclip over OPENCLAW over vLLM over Dokers, over LLM flavours , over Linux DGX Spark / GB10	14	3761	March 25, 2026

Claude Code + VLLM on nvcr.io/nvidia/vllm

Simple installation on Windows, Linux and macOS

Windows

Linux and macOS

Basic configuration

Windows PowerShell

Linux and macOS

Starting the proxy

Why the proxy is useful

Example architecture

Health checks

Project scope

Related topics