[GB10] vLLM + DeepSeek-R1-32B Stable Setup on Blackwell — Full Protocol After 4 Days of Failures

Hey everyone,

I spent 4 days getting vLLM stable on a Blackwell GB10 (120GB VRAM) running DeepSeek-R1-Distill-Qwen-32B. Most of what broke is not documented anywhere — so I’m sharing the full failure log here and in the repo linked below.

Hardware: NVIDIA Blackwell GB10 | 120GB VRAM | aarch64 (sbsa-linux)
Model: DeepSeek-R1-Distill-Qwen-32B (bfloat16)
Engine: vLLM compiled from source
OS: Linux aarch64


What Broke (In Order)

FAIL-001 — PyTorch reports cuda.is_available() = False
The system ships with a +cpu PyTorch build. It loads models, runs inference, produces output — entirely on CPU. No warning, no error. The only signal is nvidia-smi showing zero GPU memory usage.

Fix: Uninstall default PyTorch, install cu121 nightly build with SM_100 support.

bash

pip3 uninstall torch torchvision torchaudio -y
pip3 install --pre torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/nightly/cu121 \
  --break-system-packages

FAIL-002 — vLLM build exits with metadata-generation-failed
Happens instantly, before a single line of C++ compiles. Root cause is pyproject.toml using deprecated PEP 621 license format.

Fix:

bash

sed -i 's/license = "Apache-2.0"/license = {text = "Apache-2.0"}/g' pyproject.toml
sed -i '/license-files =/d' pyproject.toml

FAIL-003 — OOM Killer terminates build mid-compilation
Unlimited parallel CUDA kernel compilation exhausts system RAM. The build disappears silently — no error in terminal, only visible in dmesg.

Fix:

bash

MAX_JOBS=8 python3 setup.py build_ext --inplace

FAIL-004 — 70B model causes silent OOM at weight loading
70B bfloat16 = ~132GB VRAM required. GB10 = 120GB available. OOM Killer fires at ~90% weight loading. No CUDA exception, no Python traceback — just silent process termination.

Fix: Switch to 32B (~64GB VRAM). Leaves 56GB for KV Cache — actually faster on long context tasks than a memory-constrained 70B would be.


FAIL-005 — V1 engine silent unresponsive state on long CoT
Server starts cleanly, short prompts work fine. Prompts triggering deep reasoning (10K+ token output) cause the server to become unresponsive after 10-15 minutes. Process alive, no errors, no responses.

Fix:

bash

export VLLM_USE_V1=0

FAIL-006 — Health check timeout loop
--gpu-memory-utilization 0.95+ leaves no headroom for the OS process scheduler. Gnome/Xorg spikes cause health check deadline misses, triggering model reload loops.

Fix: --gpu-memory-utilization 0.85


FAIL-007 — Linker error: ld: cannot find -lcuda
Standard lib64 path is insufficient on aarch64. Blackwell requires the sbsa-linux library path explicitly.

Fix — seal these before any build:

bash

export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=$CUDA_HOME/targets/sbsa-linux/lib:$CUDA_HOME/lib64:/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH
export PATH=$CUDA_HOME/bin:$PATH

Final Working Launch Command

bash

export VLLM_USE_V1=0

CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.openai.api_server \
  --model "/home/nvidia/.cache/models/deepseek-r1-32b" \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --dtype bfloat16 \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --enforce-eager

What I’m Building On Top of This

This stable inference layer is the foundation for ANF (Autonomous Native Forge) — a 4-agent self-healing software production pipeline running entirely on local hardware with no cloud dependency.

Full setup protocol and failure log: github.com/trgysvc/AutonomousNativeForge

Happy to answer questions about any of the failure modes above. If you’ve hit different issues on GB10, I’d like to know — especially around multi-GPU tensor parallelism which is the next thing I’m testing.

— Turgay Savacı

1 Like

Hey, @turgaysavaci! Thanks for sharing! What do you use as interface to the model? Is this opencode or something else?

1 Like

Thank you for the writeup. I will move this to GB10 Projects

1 Like

Hey! The interface is a custom-built agent layer in pure Node.js — no frameworks, no OpenCode, nothing external. Each agent communicates with the model via a direct HTTP request to vLLM’s OpenAI-compatible endpoint (/v1/chat/completions on localhost:8000). No SDK, no client library — just Node.js native http module: const req = http.request({ hostname: ‘localhost’, port: 8000, path: ‘/v1/chat/completions’, method: ‘POST’ }, …)

That’s the entire “interface”. Full architecture in the repo:

Thank you for the writeup and for moving this to GB10 Projects — really appreciated.

1 Like