[GB10] vLLM + DeepSeek-R1-32B on Blackwell aarch64 — 4 more failure modes (v2 protocol)

Follow-up to my earlier post about getting vLLM stable on GB10. Did a few more full rebuilds while testing and hit 4 new failures that weren’t in the first writeup — all specific to aarch64 + CUDA 13.0.

Setup: GB10 | sbsa-linux | Python 3.12 | CUDA 13.0 | vLLM v0.7.1


  1. cu121 has no aarch64 wheels

The original protocol used the cu121 index. On aarch64 it just fails:

ERROR: Could not find a version that satisfies the requirement torch

Switch to cu130 — that’s the one with aarch64 builds:

sudo pip3 install --pre torch torchvision torchaudio
–index-url https://download.pytorch.org/whl/nightly/cu130
–break-system-packages


  1. ncclWaitSignal undefined symbol

After cu130 torch installs, importing it crashed:

ImportError: libtorch_cuda.so: undefined symbol: ncclWaitSignal

apt NCCL doesn’t have that symbol. pip installs nvidia-nccl-cu13 which does, but the linker doesn’t pick it up automatically.

Force it:

export LD_PRELOAD=/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2

Has to be set before any Python call. Sealed it into the systemd service Environment= so I don’t have to think about it.


  1. numa.h missing during vLLM CPU extension build

fatal error: numa.h: No such file or directory vLLM’s CPU extension needs libnuma-dev. Simple fix:

sudo apt-get install -y libnuma-dev


  1. ABI mismatch — MessageLogger undefined symbol

After a complete build, launching vLLM always failed:

ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib

Ran nm on both the binary and the torch library:

nm -D vllm/_C.abi3.so | grep MessageLogger

U _ZN3c1013MessageLoggerC1EPKciib ← (const char*, int, int, bool)

nm -D /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so | grep MessageLogger

T _ZN3c1013MessageLoggerC1ENS_14SourceLocationEib ← (SourceLocation, int, bool)

Different signatures. vLLM compiled against old headers, runtime found the newer cu130 torch.

Root cause: pip’s build isolation. pip install -e . creates an isolated environment and downloads a separate older torch based on pyproject.toml constraints. vLLM compiles against those old headers. At runtime it finds the newer cu130 — mismatch.

Fix is --no-build-isolation. But sudo -E alone doesn’t carry LD_PRELOAD into pip’s subprocess chain. You need to inject explicitly:

sudo -E env
LD_PRELOAD=“/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2”
LD_LIBRARY_PATH=“/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib:/usr/local/cuda-13.0/targets/sbsa-linux/lib:/usr/local/cuda-13.0/lib64”
MAX_JOBS=8
pip3 install -e . --no-deps --no-build-isolation --break-system-packages

Verify after installation:

nm -D vllm/_C.abi3.so | grep MessageLogger

must say SourceLocation, not EPKciib


Bonus: agent 404

If anything queries vLLM by model name, add:
–served-model-name deepseek-r1-32b

Without it vLLM serves under the full file path and anything using the short name gets 404.


Full v2 protocol with automation script and all failure modes:
GitHub - trgysvc/AutonomousNativeForge: ANF — Autonomous Native Forge is a cloud-free, self-healing software production pipeline powered by 4 AI agents (PM, Architect, Coder, Reviewer). Built entirely on Node.js native modules — no middleware, no external dependencies. Runs on local hardware: NVIDIA GPU, Apple Silicon (Unified Memory) and NPU-accelerated devices. Local LLM inference only · GitHub → docs/BLACKWELL_SETUP_V2.md

The repo is for ANF — a 4-agent autonomous coding pipeline running on top of this inference stack. Setup docs work standalone if that’s all you need.

Anyone else seen the ABI mismatch? Wondering if it’s aarch64-specific or also shows up on x86_64 with cu130.