I just managed to get NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 running locally using my vLLM inference engine on a single NVIDIA Jetson AGX Thor with 128 GB of unified memory.
1. Create a fresh venv
uv venv .vllm --python 3.12
source .vllm/bin/activate
2. Install torch with CUDA 13
uv pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu130
3. Set environment for sm110
export TORCH_CUDA_ARCH_LIST=11.0a
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"
4. Build vllm from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
MAX_JOBS=$(nproc) python3 setup.py bdist_wheel
uv pip install -r requirements/common.txt
uv pip install --no-deps dist/vllm*.whl
cd ..
5. Build FlashInfer from source
git clone --recursive https://github.com/flashinfer-ai/flashinfer.git
cd flashinfer
export FLASHINFER_CUDA_ARCH_LIST="11.0a"
uv pip install -r requirements.txt
MAX_JOBS=$(nproc) python -m flashinfer.aot
python3 -m build --no-isolation --wheel
uv pip install --no-deps dist/flashinfer*.whl
Build flashinfer-cubin
cd flashinfer-cubin
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl
Build flashinfer-jit-cache (edit pyproject.toml to remove nvidia-nvshmem-cu12 dep first)
cd ../flashinfer-jit-cache
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl
cd ../..
6. Clear old caches
rm -rf ~/.cache/flashinfer/
rm -rf ~/.cache/vllm/
7. Run with NVFP4, without max-cudagraph-capture-size, vLLM is capturing CUDA graphs for all the batch sizes in the list (1, 2, 4, 8, 16, … up to 512).
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.8 \
--max-cudagraph-capture-size 32 \
--max-num-seqs 32 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3
Output:
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:37] Available routes are:
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=163557) INFO: Started server process [163557]
(APIServer pid=163557) INFO: Waiting for application startup.
(APIServer pid=163557) INFO: Application startup complete.
(APIServer pid=163557) INFO: 127.0.0.1:51430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=163557) INFO 03-14 21:52:22 [loggers.py:259] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
