Running NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on the Nvidia Jetson Thor

shahizat · March 14, 2026, 5:44pm

I just managed to get NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 running locally using my vLLM inference engine on a single NVIDIA Jetson AGX Thor with 128 GB of unified memory.

1. Create a fresh venv

uv venv .vllm --python 3.12
source .vllm/bin/activate

2. Install torch with CUDA 13

uv pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu130

3. Set environment for sm110

export TORCH_CUDA_ARCH_LIST=11.0a
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"

4. Build vllm from source

git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
MAX_JOBS=$(nproc) python3 setup.py bdist_wheel
uv pip install -r requirements/common.txt
uv pip install --no-deps dist/vllm*.whl
cd ..

5. Build FlashInfer from source

git clone --recursive https://github.com/flashinfer-ai/flashinfer.git
cd flashinfer
export FLASHINFER_CUDA_ARCH_LIST="11.0a"
uv pip install -r requirements.txt
MAX_JOBS=$(nproc) python -m flashinfer.aot
python3 -m build --no-isolation --wheel
uv pip install --no-deps dist/flashinfer*.whl

Build flashinfer-cubin

cd flashinfer-cubin
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl

Build flashinfer-jit-cache (edit pyproject.toml to remove nvidia-nvshmem-cu12 dep first)

cd ../flashinfer-jit-cache
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl
cd ../..

6. Clear old caches

rm -rf ~/.cache/flashinfer/
rm -rf ~/.cache/vllm/

7. Run with NVFP4, without max-cudagraph-capture-size, vLLM is capturing CUDA graphs for all the batch sizes in the list (1, 2, 4, 8, 16, … up to 512).

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.8 \
--max-cudagraph-capture-size 32 \
--max-num-seqs 32 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3

Output:

(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:37] Available routes are:
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=163557) INFO:     Started server process [163557]
(APIServer pid=163557) INFO:     Waiting for application startup.
(APIServer pid=163557) INFO:     Application startup complete.
(APIServer pid=163557) INFO:     127.0.0.1:51430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=163557) INFO 03-14 21:52:22 [loggers.py:259] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

AastaLLL · March 16, 2026, 3:09am

Thanks for sharing this and the detailed command.

shahizat · March 16, 2026, 5:56pm

using llama.cpp:

sudo apt update
sudo apt install libcurl4-openssl-dev libssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="110" -DGGML_CUDA_FA_ALL_QUANTS=ON
make -j8

./bin/llama-server -hf ggml-org/Nemotron-3-Super-120B-GGUF

vlad59 · March 18, 2026, 4:33pm

something s not right, Qwen3.5 122B gives 20tps @25k requests with up to 130tps at 10 requests with mtp-5
on Package vllm · GitHub

ckdavid233 · March 19, 2026, 9:11am

I‘ve followed the installation instrution in this topic.However I ran into error when I start the model:

python3 -m vllm.entrypoints.openai.api_server \
  --model ~/apps/models/Qwen3-30B-A3B-AWQ \
  --dtype auto \
  --tensor-parallel-size 1 \
  --max-model-len 28000 \
  --gpu-memory-utilization 0.76 \
  --served-model-name Qwen3-30B-A3B-AWQ

I got the error

can't find gpu-name sm_110a

I added a step in order to avoid triton backend:


mv ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell \
   ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell.bak

ln -s /usr/local/cuda-13/bin/ptxas \
      ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell

then I restart I met this error

(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self.runnable(*args, **kwargs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return range_entry.runnable(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/compiler_interface.py", line 452, in compiled_graph_wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     graph_output = inductor_compiled_graph(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self._compiled_fn(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 215, in <lambda>
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                                                ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     all_outs = call_func_at_runtime_with_args(
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     out = normalize_as_list(f(args))
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                             ^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return compiled_fn(runtime_args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self.current_callable(inputs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     out = model(new_inputs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]           ^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/tmp/torchinductor_x/z4/cz4y7gm3hvt7ggf4t6kdu7uhq4hntplj4tt345wvjstwvvnbe3ye.py", line 1132, in call
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     extern_kernels.mm(buf3, reinterpret_tensor(arg10_1, (2048, 128), (1, 2048), 0), out=buf4)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16F, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

my machine environment is Jetpack 7.1
(I flashed it today and installed something below):

sudo apt update
sudo apt upgrade
sudo apt install nvidia-jetpack


wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.0.1/local_installers/cuda-repo-ubuntu2404-13-0-local_13.0.1-580.82.07-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-0-local_13.0.1-580.82.07-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install -y cuda-toolkit-13-0

wget https://developer.download.nvidia.com/compute/cusparselt/0.8.1/local_installers/cusparselt-local-repo-ubuntu2404-0.8.1_0.8.1-1_arm64.deb
sudo dpkg -i cusparselt-local-repo-ubuntu2404-0.8.1_0.8.1-1_arm64.deb
sudo cp /var/cusparselt-local-repo-ubuntu2404-0.8.1/cusparselt-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cusparselt-cuda-13

wget https://developer.download.nvidia.com/compute/nvpl/25.11/local_installers/nvpl-local-repo-ubuntu2404-25.11_1.0-1_arm64.deb
sudo dpkg -i nvpl-local-repo-ubuntu2404-25.11_1.0-1_arm64.deb
sudo cp /var/nvpl-local-repo-ubuntu2404-25.11/nvpl-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install nvpl

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt install libnccl2=2.28.3-1+cuda13.0 libnccl-dev=2.28.3-1+cuda13.0

I’ve installed them today and wanted to get this installation guide.However it didn’t work for me.

python environment is as below


Using Python 3.12.3 environment at: /home/x/.vllm
Package                                  Version
---------------------------------------- ------------------------------------------
aiohappyeyeballs                         2.6.1
aiohttp                                  3.13.3
aiosignal                                1.4.0
annotated-doc                            0.0.4
annotated-types                          0.7.0
anthropic                                0.86.0
anyio                                    4.12.1
apache-tvm-ffi                           0.1.9
astor                                    0.8.1
attrs                                    25.4.0
blake3                                   1.0.8
build                                    1.4.0
cachetools                               7.0.5
cbor2                                    5.8.0
certifi                                  2026.2.25
cffi                                     2.0.0
charset-normalizer                       3.4.6
click                                    8.3.1
cloudpickle                              3.1.2
cmake                                    4.2.3
compressed-tensors                       0.14.0.1
cryptography                             46.0.5
cuda-bindings                            13.0.3
cuda-pathfinder                          1.2.2
cuda-python                              13.0.3
cuda-tile                                1.2.0
depyf                                    0.20.0
dill                                     0.4.1
diskcache                                5.6.3
distro                                   1.9.0
dnspython                                2.8.0
docstring-parser                         0.17.0
einops                                   0.8.2
email-validator                          2.3.0
fastapi                                  0.135.1
fastapi-cli                              0.0.24
fastapi-cloud-cli                        0.15.0
fastar                                   0.8.0
filelock                                 3.20.0
flashinfer-cubin                         0.6.6
flashinfer-python                        0.6.6
frozenlist                               1.8.0
fsspec                                   2025.12.0
gguf                                     0.18.0
googleapis-common-protos                 1.73.0
grpcio                                   1.78.0
h11                                      0.16.0
hf-xet                                   1.4.2
httpcore                                 1.0.9
httptools                                0.7.1
httpx                                    0.28.1
httpx-sse                                0.4.3
huggingface-hub                          0.36.2
idna                                     3.11
ijson                                    3.5.0
importlib-metadata                       8.7.1
interegular                              0.3.3
jinja2                                   3.1.6
jiter                                    0.13.0
jmespath                                 1.1.0
jsonschema                               4.26.0
jsonschema-specifications                2025.9.1
lark                                     1.2.2
llguidance                               1.3.0
lm-format-enforcer                       0.11.3
loguru                                   0.7.3
markdown-it-py                           4.0.0
markupsafe                               3.0.2
mcp                                      1.26.0
mdurl                                    0.1.2
mistral-common                           1.10.0
model-hosting-container-standards        0.1.14
modelscope                               1.35.0
mpmath                                   1.3.0
msgspec                                  0.20.0
multidict                                6.7.1
networkx                                 3.6.1
ninja                                    1.13.0
numpy                                    2.3.5
nvidia-cublas                            13.1.0.3
nvidia-cuda-cupti                        13.0.85
nvidia-cuda-nvrtc                        13.0.88
nvidia-cuda-runtime                      13.0.96
nvidia-cudnn-cu13                        9.15.1.9
nvidia-cudnn-frontend                    1.20.0
nvidia-cufft                             12.0.0.61
nvidia-cufile                            1.15.1.6
nvidia-curand                            10.4.0.35
nvidia-cusolver                          12.0.4.66
nvidia-cusparse                          12.6.3.3
nvidia-cusparselt-cu13                   0.8.0
nvidia-cutlass-dsl                       4.4.2
nvidia-cutlass-dsl-libs-base             4.4.2
nvidia-ml-py                             13.590.48
nvidia-nccl-cu13                         2.28.9
nvidia-nvjitlink                         13.0.88
nvidia-nvshmem-cu13                      3.4.5
nvidia-nvtx                              13.0.85
openai                                   2.29.0
openai-harmony                           0.0.8
opencv-python-headless                   4.13.0.92
opentelemetry-api                        1.40.0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
opentelemetry-semantic-conventions-ai    0.4.15
outlines-core                            0.2.11
packaging                                26.0
partial-json-parser                      0.2.1.1.post7
pillow                                   12.0.0
prometheus-client                        0.24.1
prometheus-fastapi-instrumentator        7.1.0
propcache                                0.4.1
protobuf                                 6.33.6
psutil                                   7.2.2
py-cpuinfo                               9.0.0
pybase64                                 1.4.3
pycountry                                26.2.16
pycparser                                3.0
pydantic                                 2.12.5
pydantic-core                            2.41.5
pydantic-extra-types                     2.11.1
pydantic-settings                        2.13.1
pygments                                 2.19.2
pyjwt                                    2.12.1
pyproject-hooks                          1.2.0
python-dotenv                            1.2.2
python-json-logger                       4.0.0
python-multipart                         0.0.22
pyyaml                                   6.0.3
pyzmq                                    27.1.0
referencing                              0.37.0
regex                                    2026.2.28
requests                                 2.32.5
rich                                     14.3.3
rich-toolkit                             0.19.7
rignore                                  0.7.6
rpds-py                                  0.30.0
safetensors                              0.7.0
sentencepiece                            0.2.1
sentry-sdk                               2.55.0
setproctitle                             1.3.7
setuptools                               80.10.2
setuptools-scm                           9.2.2
shellingham                              1.5.4
six                                      1.17.0
sniffio                                  1.3.1
sse-starlette                            3.3.3
starlette                                0.52.1
supervisor                               4.3.0
sympy                                    1.14.0
tabulate                                 0.10.0
tiktoken                                 0.12.0
tokenizers                               0.22.2
torch                                    2.10.0+cu130
torchaudio                               2.10.0+cu130
torchvision                              0.25.0+cu130
tqdm                                     4.67.3
transformers                             4.57.6
triton                                   3.6.0
typer                                    0.24.1
typing-extensions                        4.15.0
typing-inspection                        0.4.2
urllib3                                  2.6.3
uvicorn                                  0.42.0
uvloop                                   0.22.1
vllm                                     0.17.2rc1.dev96+ge3126cd10.d20260319.cu130
watchfiles                               1.1.1
websockets                               16.0
wheel                                    0.46.3
xgrammar                                 0.1.32
yarl                                     1.23.0
zipp                                     3.23.0

AastaLLL · March 23, 2026, 6:58am

Hi,

JetPack already includes CUDA, so you don’t need to manually install it.
Could you try to test a CUDA sample to see if CUDA works well in your environment first?

Thanks.

ckdavid233 · March 23, 2026, 8:25am

I tested CUDA with the official deviceQuery sample on Jetson Thor.
It detects NVIDIA Thor, shows CUDA Driver/Runtime 13.0 / 13.0, CUDA capability 11.0, and returns Result = PASS.
So the base CUDA environment appears to be working.
My remaining issue seems to be in the PyTorch / Triton / vLLM stack, especially around sm_110a / ptxas-blackwell support.
Here is the procedure below:

git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples
mkdir build
cd build
cmake ..
make -j$(nproc)
cd ~/cuda-samples/build/Samples/1_Utilities/deviceQuery
./deviceQuery

The output is below:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Thor"
  CUDA Driver Version / Runtime Version          13.0 / 13.0
  CUDA Capability Major/Minor version number:    11.0
  Total amount of global memory:                 125772 MBytes (131881181184 bytes)
  (020) Multiprocessors, (128) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1049 MHz (1.05 GHz)
  Memory Clock rate:                             666 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        233472 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.0, CUDA Runtime Version = 13.0, NumDevs = 1
Result = PASS

Then the python environment is below:

python3 - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device:", torch.cuda.get_device_name(0))
    print("capability:", torch.cuda.get_device_capability(0))
PY

python3 - <<'PY'
try:
    import triton
    print("triton:", triton.__version__)
except Exception as e:
    print("triton import failed:", e)
PY

python3 - <<'PY'
try:
    import vllm
    print("vllm:", vllm.__version__)
except Exception as e:
    print("vllm import failed:", e)
PY

which ptxas
ptxas --version
torch: 2.10.0+cu130
torch cuda: 13.0
cuda available: True
device: NVIDIA Thor
capability: (11, 0)
triton: 3.6.0
vllm: 0.17.2rc1.dev96+ge3126cd10.d20260319
/usr/local/cuda-13.0/bin/ptxas
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:53:56_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Btw,I have a question about the correct installation workflow on Jetson Thor.

If I flash the device using SDK Manager and install JetPack 7.1,
which already includes CUDA, cuDNN, NCCL, and other dependencies,(I don’t know if these were included and whether should I install them again?)
should I directly follow the vLLM installation guide from there?

In other words, is it unnecessary (and possibly incorrect) to manually install CUDA toolkit or other low-level libraries again?

I want to confirm whether the recommended approach is to rely entirely on the JetPack-provided environment and only install Python-level dependencies for vLLM.

Thanks!

AastaLLL · March 25, 2026, 6:25am

Hi,

Is a container an option for you?
If yes, could you try if the model can work in our nvcr.io/nvidia/vllm:26.02-py3 container?

Thanks.

ckdavid233 · March 25, 2026, 7:05am

Thanks.I’ve tried containers from the docker images below:

ghcr.io/nvidia-ai-iot/vllm
nvcr.io/nvidia/vllm:26.02-py3

They both worked well.But I want to know how to build and use the newest vLLM environment in my thor?And I am considering that How to install it locally and What’s wrong with my current environment.

AastaLLL · March 30, 2026, 8:38am

Hi,

You can follow the instructions below:

Thanks.

Topic		Replies	Views
Run VLLM in Thor from VLLM Repository Jetson Thor	15	2009	November 29, 2025
求救，运行vllm报错 Jetson Thor camera , generative_ai	4	464	November 17, 2025
Unable to run Nemotron AGX Thor Dev Kit Jetson Thor generative_ai , nemotron	10	167	March 11, 2026
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , deepseek , nemotron	46	3648	December 14, 2025
Run vllm fail Jetson Thor generative_ai	2	406	September 11, 2025
Install vllm in Thor failed Jetson Thor generative_ai	6	1174	October 16, 2025
求救，Jetson Agx Thor边缘设备安装了cuda13，怎么安装vllm0.11.0版本呢。谢谢解答 Jetson Thor generative_ai	2	121	November 11, 2025
Testing NVIDIA-Nemotron-3-Nano-4B- Model on Nvidia DGX Spark/Jetson Thor/6000 Pro with vLLM DGX Spark / GB10 jetson , nemotron	1	169	March 22, 2026
JetPack 7.0/Jetson Linux 38.2 for NVIDIA Jetson Thor is now live Jetson Thor cudnn , llama	20	3438	October 27, 2025
vLLM 0.12.x Container for jetson Thor Jetson Thor generative_ai	4	213	January 8, 2026

Running NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on the Nvidia Jetson Thor

1. Create a fresh venv

2. Install torch with CUDA 13

3. Set environment for sm110

4. Build vllm from source

5. Build FlashInfer from source

Build flashinfer-cubin

Build flashinfer-jit-cache (edit pyproject.toml to remove nvidia-nvshmem-cu12 dep first)

6. Clear old caches

7. Run with NVFP4, without max-cudagraph-capture-size, vLLM is capturing CUDA graphs for all the batch sizes in the list (1, 2, 4, 8, 16, … up to 512).

Related topics