Running NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 on the Nvidia Jetson Thor

I just managed to get NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 running locally using my vLLM inference engine on a single NVIDIA Jetson AGX Thor with 128 GB of unified memory.

1. Create a fresh venv

uv venv .vllm --python 3.12
source .vllm/bin/activate

2. Install torch with CUDA 13

uv pip install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/cu130

3. Set environment for sm110

export TORCH_CUDA_ARCH_LIST=11.0a
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"

4. Build vllm from source

git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
MAX_JOBS=$(nproc) python3 setup.py bdist_wheel
uv pip install -r requirements/common.txt
uv pip install --no-deps dist/vllm*.whl
cd ..

5. Build FlashInfer from source

git clone --recursive https://github.com/flashinfer-ai/flashinfer.git
cd flashinfer
export FLASHINFER_CUDA_ARCH_LIST="11.0a"
uv pip install -r requirements.txt
MAX_JOBS=$(nproc) python -m flashinfer.aot
python3 -m build --no-isolation --wheel
uv pip install --no-deps dist/flashinfer*.whl

Build flashinfer-cubin

cd flashinfer-cubin
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl

Build flashinfer-jit-cache (edit pyproject.toml to remove nvidia-nvshmem-cu12 dep first)

cd ../flashinfer-jit-cache
python -m build --no-isolation --wheel
uv pip install --no-deps dist/*.whl
cd ../..

6. Clear old caches

rm -rf ~/.cache/flashinfer/
rm -rf ~/.cache/vllm/

7. Run with NVFP4, without max-cudagraph-capture-size, vLLM is capturing CUDA graphs for all the batch sizes in the list (1, 2, 4, 8, 16, … up to 512).

vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--async-scheduling \
--dtype auto \
--kv-cache-dtype fp8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 1 \
--data-parallel-size 1 \
--trust-remote-code \
--attention-backend TRITON_ATTN \
--gpu-memory-utilization 0.8 \
--max-cudagraph-capture-size 32 \
--max-num-seqs 32 \
--enable-chunked-prefill \
--host 0.0.0.0 \
--port 5000 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser-plugin "./super_v3_reasoning_parser.py" \
--reasoning-parser super_v3

Output:

(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:37] Available routes are:
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=163557) INFO 03-14 21:51:32 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=163557) INFO:     Started server process [163557]
(APIServer pid=163557) INFO:     Waiting for application startup.
(APIServer pid=163557) INFO:     Application startup complete.
(APIServer pid=163557) INFO:     127.0.0.1:51430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=163557) INFO 03-14 21:52:22 [loggers.py:259] Engine 000: Avg prompt throughput: 3.6 tokens/s, Avg generation throughput: 7.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:52:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:53:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 10.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=163557) INFO 03-14 21:54:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
4 Likes

Thanks for sharing this and the detailed command.

1 Like

using llama.cpp:

sudo apt update
sudo apt install libcurl4-openssl-dev libssl-dev

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="110" -DGGML_CUDA_FA_ALL_QUANTS=ON
make -j8

./bin/llama-server -hf ggml-org/Nemotron-3-Super-120B-GGUF

1 Like

something s not right, Qwen3.5 122B gives 20tps @25k requests with up to 130tps at 10 requests with mtp-5
on Package vllm · GitHub

I‘ve followed the installation instrution in this topic.However I ran into error when I start the model:

python3 -m vllm.entrypoints.openai.api_server \
  --model ~/apps/models/Qwen3-30B-A3B-AWQ \
  --dtype auto \
  --tensor-parallel-size 1 \
  --max-model-len 28000 \
  --gpu-memory-utilization 0.76 \
  --served-model-name Qwen3-30B-A3B-AWQ

I got the error

can't find gpu-name sm_110a

I added a step in order to avoid triton backend:


mv ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell \
   ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell.bak

ln -s /usr/local/cuda-13/bin/ptxas \
      ~/.vllm/lib/python3.12/site-packages/triton/backends/nvidia/bin/ptxas-blackwell

then I restart I met this error

(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self.runnable(*args, **kwargs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/piecewise_backend.py", line 367, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return range_entry.runnable(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/vllm/compilation/compiler_interface.py", line 452, in compiled_graph_wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     graph_output = inductor_compiled_graph(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self._compiled_fn(*args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 215, in <lambda>
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return CacheCompiledArtifact(lambda *args: compiled_fn(list(args)), None)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                                                ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     all_outs = call_func_at_runtime_with_args(
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     out = normalize_as_list(f(args))
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]                             ^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return compiled_fn(runtime_args)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     return self.current_callable(inputs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/home/x/.vllm/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     out = model(new_inputs)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]           ^^^^^^^^^^^^^^^^^
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]   File "/tmp/torchinductor_x/z4/cz4y7gm3hvt7ggf4t6kdu7uhq4hntplj4tt345wvjstwvvnbe3ye.py", line 1132, in call
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099]     extern_kernels.mm(buf3, reinterpret_tensor(arg10_1, (2048, 128), (1, 2048), 0), out=buf4)
(EngineCore pid=196310) ERROR 03-19 17:01:53 [core.py:1099] RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmEx( handle, opa, opb, m, n, k, alpha_ptr, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, beta_ptr, c, std::is_same_v<C_Dtype, float> ? CUDA_R_32F : CUDA_R_16F, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

my machine environment is Jetpack 7.1
(I flashed it today and installed something below):

sudo apt update
sudo apt upgrade
sudo apt install nvidia-jetpack

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/13.0.1/local_installers/cuda-repo-ubuntu2404-13-0-local_13.0.1-580.82.07-1_arm64.deb
sudo dpkg -i cuda-repo-ubuntu2404-13-0-local_13.0.1-580.82.07-1_arm64.deb
sudo cp /var/cuda-repo-ubuntu2404-13-0-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install -y cuda-toolkit-13-0
wget https://developer.download.nvidia.com/compute/cusparselt/0.8.1/local_installers/cusparselt-local-repo-ubuntu2404-0.8.1_0.8.1-1_arm64.deb
sudo dpkg -i cusparselt-local-repo-ubuntu2404-0.8.1_0.8.1-1_arm64.deb
sudo cp /var/cusparselt-local-repo-ubuntu2404-0.8.1/cusparselt-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cusparselt-cuda-13
wget https://developer.download.nvidia.com/compute/nvpl/25.11/local_installers/nvpl-local-repo-ubuntu2404-25.11_1.0-1_arm64.deb
sudo dpkg -i nvpl-local-repo-ubuntu2404-25.11_1.0-1_arm64.deb
sudo cp /var/nvpl-local-repo-ubuntu2404-25.11/nvpl-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install nvpl
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt install libnccl2=2.28.3-1+cuda13.0 libnccl-dev=2.28.3-1+cuda13.0

I’ve installed them today and wanted to get this installation guide.However it didn’t work for me.

python environment is as below


Using Python 3.12.3 environment at: /home/x/.vllm
Package                                  Version
---------------------------------------- ------------------------------------------
aiohappyeyeballs                         2.6.1
aiohttp                                  3.13.3
aiosignal                                1.4.0
annotated-doc                            0.0.4
annotated-types                          0.7.0
anthropic                                0.86.0
anyio                                    4.12.1
apache-tvm-ffi                           0.1.9
astor                                    0.8.1
attrs                                    25.4.0
blake3                                   1.0.8
build                                    1.4.0
cachetools                               7.0.5
cbor2                                    5.8.0
certifi                                  2026.2.25
cffi                                     2.0.0
charset-normalizer                       3.4.6
click                                    8.3.1
cloudpickle                              3.1.2
cmake                                    4.2.3
compressed-tensors                       0.14.0.1
cryptography                             46.0.5
cuda-bindings                            13.0.3
cuda-pathfinder                          1.2.2
cuda-python                              13.0.3
cuda-tile                                1.2.0
depyf                                    0.20.0
dill                                     0.4.1
diskcache                                5.6.3
distro                                   1.9.0
dnspython                                2.8.0
docstring-parser                         0.17.0
einops                                   0.8.2
email-validator                          2.3.0
fastapi                                  0.135.1
fastapi-cli                              0.0.24
fastapi-cloud-cli                        0.15.0
fastar                                   0.8.0
filelock                                 3.20.0
flashinfer-cubin                         0.6.6
flashinfer-python                        0.6.6
frozenlist                               1.8.0
fsspec                                   2025.12.0
gguf                                     0.18.0
googleapis-common-protos                 1.73.0
grpcio                                   1.78.0
h11                                      0.16.0
hf-xet                                   1.4.2
httpcore                                 1.0.9
httptools                                0.7.1
httpx                                    0.28.1
httpx-sse                                0.4.3
huggingface-hub                          0.36.2
idna                                     3.11
ijson                                    3.5.0
importlib-metadata                       8.7.1
interegular                              0.3.3
jinja2                                   3.1.6
jiter                                    0.13.0
jmespath                                 1.1.0
jsonschema                               4.26.0
jsonschema-specifications                2025.9.1
lark                                     1.2.2
llguidance                               1.3.0
lm-format-enforcer                       0.11.3
loguru                                   0.7.3
markdown-it-py                           4.0.0
markupsafe                               3.0.2
mcp                                      1.26.0
mdurl                                    0.1.2
mistral-common                           1.10.0
model-hosting-container-standards        0.1.14
modelscope                               1.35.0
mpmath                                   1.3.0
msgspec                                  0.20.0
multidict                                6.7.1
networkx                                 3.6.1
ninja                                    1.13.0
numpy                                    2.3.5
nvidia-cublas                            13.1.0.3
nvidia-cuda-cupti                        13.0.85
nvidia-cuda-nvrtc                        13.0.88
nvidia-cuda-runtime                      13.0.96
nvidia-cudnn-cu13                        9.15.1.9
nvidia-cudnn-frontend                    1.20.0
nvidia-cufft                             12.0.0.61
nvidia-cufile                            1.15.1.6
nvidia-curand                            10.4.0.35
nvidia-cusolver                          12.0.4.66
nvidia-cusparse                          12.6.3.3
nvidia-cusparselt-cu13                   0.8.0
nvidia-cutlass-dsl                       4.4.2
nvidia-cutlass-dsl-libs-base             4.4.2
nvidia-ml-py                             13.590.48
nvidia-nccl-cu13                         2.28.9
nvidia-nvjitlink                         13.0.88
nvidia-nvshmem-cu13                      3.4.5
nvidia-nvtx                              13.0.85
openai                                   2.29.0
openai-harmony                           0.0.8
opencv-python-headless                   4.13.0.92
opentelemetry-api                        1.40.0
opentelemetry-exporter-otlp              1.40.0
opentelemetry-exporter-otlp-proto-common 1.40.0
opentelemetry-exporter-otlp-proto-grpc   1.40.0
opentelemetry-exporter-otlp-proto-http   1.40.0
opentelemetry-proto                      1.40.0
opentelemetry-sdk                        1.40.0
opentelemetry-semantic-conventions       0.61b0
opentelemetry-semantic-conventions-ai    0.4.15
outlines-core                            0.2.11
packaging                                26.0
partial-json-parser                      0.2.1.1.post7
pillow                                   12.0.0
prometheus-client                        0.24.1
prometheus-fastapi-instrumentator        7.1.0
propcache                                0.4.1
protobuf                                 6.33.6
psutil                                   7.2.2
py-cpuinfo                               9.0.0
pybase64                                 1.4.3
pycountry                                26.2.16
pycparser                                3.0
pydantic                                 2.12.5
pydantic-core                            2.41.5
pydantic-extra-types                     2.11.1
pydantic-settings                        2.13.1
pygments                                 2.19.2
pyjwt                                    2.12.1
pyproject-hooks                          1.2.0
python-dotenv                            1.2.2
python-json-logger                       4.0.0
python-multipart                         0.0.22
pyyaml                                   6.0.3
pyzmq                                    27.1.0
referencing                              0.37.0
regex                                    2026.2.28
requests                                 2.32.5
rich                                     14.3.3
rich-toolkit                             0.19.7
rignore                                  0.7.6
rpds-py                                  0.30.0
safetensors                              0.7.0
sentencepiece                            0.2.1
sentry-sdk                               2.55.0
setproctitle                             1.3.7
setuptools                               80.10.2
setuptools-scm                           9.2.2
shellingham                              1.5.4
six                                      1.17.0
sniffio                                  1.3.1
sse-starlette                            3.3.3
starlette                                0.52.1
supervisor                               4.3.0
sympy                                    1.14.0
tabulate                                 0.10.0
tiktoken                                 0.12.0
tokenizers                               0.22.2
torch                                    2.10.0+cu130
torchaudio                               2.10.0+cu130
torchvision                              0.25.0+cu130
tqdm                                     4.67.3
transformers                             4.57.6
triton                                   3.6.0
typer                                    0.24.1
typing-extensions                        4.15.0
typing-inspection                        0.4.2
urllib3                                  2.6.3
uvicorn                                  0.42.0
uvloop                                   0.22.1
vllm                                     0.17.2rc1.dev96+ge3126cd10.d20260319.cu130
watchfiles                               1.1.1
websockets                               16.0
wheel                                    0.46.3
xgrammar                                 0.1.32
yarl                                     1.23.0
zipp                                     3.23.0

Hi,

JetPack already includes CUDA, so you don’t need to manually install it.
Could you try to test a CUDA sample to see if CUDA works well in your environment first?

Thanks.

I tested CUDA with the official deviceQuery sample on Jetson Thor.
It detects NVIDIA Thor, shows CUDA Driver/Runtime 13.0 / 13.0, CUDA capability 11.0, and returns Result = PASS.
So the base CUDA environment appears to be working.
My remaining issue seems to be in the PyTorch / Triton / vLLM stack, especially around sm_110a / ptxas-blackwell support.
Here is the procedure below:

git clone https://github.com/NVIDIA/cuda-samples.git
cd ~/cuda-samples
mkdir build
cd build
cmake ..
make -j$(nproc)
cd ~/cuda-samples/build/Samples/1_Utilities/deviceQuery
./deviceQuery

The output is below:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Thor"
  CUDA Driver Version / Runtime Version          13.0 / 13.0
  CUDA Capability Major/Minor version number:    11.0
  Total amount of global memory:                 125772 MBytes (131881181184 bytes)
  (020) Multiprocessors, (128) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1049 MHz (1.05 GHz)
  Memory Clock rate:                             666 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        233472 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 13.0, CUDA Runtime Version = 13.0, NumDevs = 1
Result = PASS

Then the python environment is below:

python3 - <<'PY'
import torch
print("torch:", torch.__version__)
print("torch cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("device:", torch.cuda.get_device_name(0))
    print("capability:", torch.cuda.get_device_capability(0))
PY

python3 - <<'PY'
try:
    import triton
    print("triton:", triton.__version__)
except Exception as e:
    print("triton import failed:", e)
PY

python3 - <<'PY'
try:
    import vllm
    print("vllm:", vllm.__version__)
except Exception as e:
    print("vllm import failed:", e)
PY

which ptxas
ptxas --version
torch: 2.10.0+cu130
torch cuda: 13.0
cuda available: True
device: NVIDIA Thor
capability: (11, 0)
triton: 3.6.0
vllm: 0.17.2rc1.dev96+ge3126cd10.d20260319
/usr/local/cuda-13.0/bin/ptxas
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:53:56_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

Btw,I have a question about the correct installation workflow on Jetson Thor.

If I flash the device using SDK Manager and install JetPack 7.1,
which already includes CUDA, cuDNN, NCCL, and other dependencies,(I don’t know if these were included and whether should I install them again?)
should I directly follow the vLLM installation guide from there?

In other words, is it unnecessary (and possibly incorrect) to manually install CUDA toolkit or other low-level libraries again?

I want to confirm whether the recommended approach is to rely entirely on the JetPack-provided environment and only install Python-level dependencies for vLLM.

Thanks!

Hi,

Is a container an option for you?
If yes, could you try if the model can work in our nvcr.io/nvidia/vllm:26.02-py3 container?

Thanks.

Thanks.I’ve tried containers from the docker images below:

ghcr.io/nvidia-ai-iot/vllm
nvcr.io/nvidia/vllm:26.02-py3

They both worked well.But I want to know how to build and use the newest vLLM environment in my thor?And I am considering that How to install it locally and What’s wrong with my current environment.

Hi,

You can follow the instructions below:

Thanks.