Hey everyone,
I spent 4 days getting vLLM stable on a Blackwell GB10 (120GB VRAM) running DeepSeek-R1-Distill-Qwen-32B. Most of what broke is not documented anywhere — so I’m sharing the full failure log here and in the repo linked below.
Hardware: NVIDIA Blackwell GB10 | 120GB VRAM | aarch64 (sbsa-linux)
Model: DeepSeek-R1-Distill-Qwen-32B (bfloat16)
Engine: vLLM compiled from source
OS: Linux aarch64
What Broke (In Order)
FAIL-001 — PyTorch reports cuda.is_available() = False
The system ships with a +cpu PyTorch build. It loads models, runs inference, produces output — entirely on CPU. No warning, no error. The only signal is nvidia-smi showing zero GPU memory usage.
Fix: Uninstall default PyTorch, install cu121 nightly build with SM_100 support.
bash
pip3 uninstall torch torchvision torchaudio -y
pip3 install --pre torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/nightly/cu121 \
--break-system-packages
FAIL-002 — vLLM build exits with metadata-generation-failed
Happens instantly, before a single line of C++ compiles. Root cause is pyproject.toml using deprecated PEP 621 license format.
Fix:
bash
sed -i 's/license = "Apache-2.0"/license = {text = "Apache-2.0"}/g' pyproject.toml
sed -i '/license-files =/d' pyproject.toml
FAIL-003 — OOM Killer terminates build mid-compilation
Unlimited parallel CUDA kernel compilation exhausts system RAM. The build disappears silently — no error in terminal, only visible in dmesg.
Fix:
bash
MAX_JOBS=8 python3 setup.py build_ext --inplace
FAIL-004 — 70B model causes silent OOM at weight loading
70B bfloat16 = ~132GB VRAM required. GB10 = 120GB available. OOM Killer fires at ~90% weight loading. No CUDA exception, no Python traceback — just silent process termination.
Fix: Switch to 32B (~64GB VRAM). Leaves 56GB for KV Cache — actually faster on long context tasks than a memory-constrained 70B would be.
FAIL-005 — V1 engine silent unresponsive state on long CoT
Server starts cleanly, short prompts work fine. Prompts triggering deep reasoning (10K+ token output) cause the server to become unresponsive after 10-15 minutes. Process alive, no errors, no responses.
Fix:
bash
export VLLM_USE_V1=0
FAIL-006 — Health check timeout loop
--gpu-memory-utilization 0.95+ leaves no headroom for the OS process scheduler. Gnome/Xorg spikes cause health check deadline misses, triggering model reload loops.
Fix: --gpu-memory-utilization 0.85
FAIL-007 — Linker error: ld: cannot find -lcuda
Standard lib64 path is insufficient on aarch64. Blackwell requires the sbsa-linux library path explicitly.
Fix — seal these before any build:
bash
export CUDA_HOME=/usr/local/cuda-13.0
export LD_LIBRARY_PATH=$CUDA_HOME/targets/sbsa-linux/lib:$CUDA_HOME/lib64:/usr/lib/aarch64-linux-gnu:$LD_LIBRARY_PATH
export PATH=$CUDA_HOME/bin:$PATH
Final Working Launch Command
bash
export VLLM_USE_V1=0
CUDA_LAUNCH_BLOCKING=1 python3 -m vllm.entrypoints.openai.api_server \
--model "/home/nvidia/.cache/models/deepseek-r1-32b" \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--dtype bfloat16 \
--port 8000 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--enforce-eager
What I’m Building On Top of This
This stable inference layer is the foundation for ANF (Autonomous Native Forge) — a 4-agent self-healing software production pipeline running entirely on local hardware with no cloud dependency.
Full setup protocol and failure log: github.com/trgysvc/AutonomousNativeForge
Happy to answer questions about any of the failure modes above. If you’ve hit different issues on GB10, I’d like to know — especially around multi-GPU tensor parallelism which is the next thing I’m testing.
— Turgay Savacı