Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch

A key advantage of the NVIDIA software stack is its commitment to continuous improvement. Frequent software optimizations mean that existing models are consistently accelerated—delivering better performance with simple software updates. We’ve demonstrated this with Jetson platforms like Orin and Xavier, and are now showcasing these gains on Jetson AGX Thor.

Just a few weeks after launch, we’ve boosted Gen AI performance on Jetson AGX Thor by up to 3.5X compared to our initial results we showcased during launch on models like Llama and DeepSeek. Thanks to FlashInfer support, Xformers integration, other optimizations, the newly released VLLM container enables these improvements. Benchmarking same Llama and DeepSeek model with same quantization on the same platform, we now get upto 3.5X more tokens/sec. Developers can expect such improvements on other models as well in future.

If you’re interested in benchmarking your own models using VLLM, check out the Jetson AI Lab Benchmarking tutorial.

VLLM Container at Launch (VLLM was part of the Triton container): nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3

Latest VLLM Container: Read below

NVIDIA Official VLLM and SGLang Containers

At launch, VLLM was packaged inside the Triton container.

  • Now, VLLM and SGLang have dedicated containers, with regular monthly releases through NVIDIA GPU Cloud (NGC) starting September 2025:

Support for Nemotron Models

NVIDIA’s Nemotron™ family of multimodal models—designed to deliver state-of-the-art reasoning for AI agents—are fully supported in the new VLLM containers on Jetson AGX Thor.

Examples of Nemotron models supported:

NVFP4 Support

NVFP4 is a novel 4-bit floating point format introduced with the NVIDIA Blackwell GPU architecture, expanding flexible, low-bit micro floating-point options for developers. The new containers also provide optimized NVFP4 implementations for models like Llama, including these checkpoints:

Benchmarking Tutorial

For detailed benchmarking steps, visit the Jetson AI Lab Benchmarking tutorial.

Enjoy next-gen Gen AI performance on Jetson AGX Thor with every software update.

3 Likes

Thank you. It looks promising. Was able to successfully run
python /opt/vllm/vllm-src/examples/offline_inference/basic/basic.py

Could you pass on a request to the build farm for ngc images that maybe near the end of Dockerfile RUN pip uninstall pynvml
because:

>>> import vllm
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
  
>>> import torch
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]

Or if no to uninstall then edit:

/usr/local/lib/python3.12/dist-packages/_pynvml_redirector.py 
and remove printing of PYNVML_MSG and PYNVML_UTILS_MSG

On Thor if you encounter an error like:

CUDA error (/opt/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:160): no kernel image is available for execution on the device. "NotImplementedError: VLLM_USE_V1=1 is not supported with VLLM_ATTENTION_BACKEND=XFORMERS

  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/selector.py", line 200, in _cached_get_attn_backend
    raise ValueError(
ValueError: Invalid attention backend. Valid backends are: ['FLASH_ATTN', 'FLASH_ATTN_VLLM_V1', 'TRITON_ATTN_VLLM_V1', 'XFORMERS', 'ROCM_FLASH', 'ROCM_AITER_MLA', 'ROCM_AITER_MLA_VLLM_V1', 'ROCM_AITER_FA', 'TORCH_SDPA', 'FLASHINFER', 'FLASHINFER_VLLM_V1', 'TRITON_MLA', 'TRITON_MLA_VLLM_V1', 'FLASHMLA_VLLM_V1', 'FLASHMLA', 'CUTLASS_MLA', 'PALLAS', 'PALLAS_VLLM_V1', 'IPEX', 'DUAL_CHUNK_FLASH_ATTN', 'DIFFERENTIAL_FLASH_ATTN', 'NO_ATTENTION', 'FLEX_ATTENTION', 'TREE_ATTN', 'XFORMERS_VLLM_V1']

You could try following.

# On Thor
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention
git submodule sync
git submodule update --init --recursive

mkdir -p "$HOME/.cache"/{huggingface,vllm,flashinfer}

# Added swap to not run out of memory, I made mine 32gb / I've got a 4tb nvme drive.
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
sudo swapon -a

# Synchronize cached writes to persistent storage. Then clear them.
sudo sync &&  echo 3 |sudo tee /proc/sys/vm/drop_caches

# Start the base docker VLLM image
docker run -it --rm --net=host --name vllm1 --runtime nvidia --privileged \
  --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=4g \
  -v $HOME/.cache:/root/.cache \
  -v $PWD:/workspace \
  --workdir /workspace \
  -e $HF_TOKEN \
  nvcr.io/nvidia/vllm:25.09-py3 bash

Now In VLLM docker container

pushd /usr/local/lib/python3.12/dist-packages
cp -pr xformers xformers_backup
popd 

pip install -U pip wheel ninja packaging 
pip uninstall -y flash-attn xformers

# Build FlashAttention from source for Thor (sm_110) to compile \
# flash_attn_2_cuda.cpython-312-aarch64-linux-gnu.so engine.

export CUDA_HOME=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="11.0+PTX"
export FLASH_ATTN_CUDA_ARCHS="110"
export FLASH_ATTENTION_FORCE_BUILD=TRUE
# Really 8 or your Thor may lock
export MAX_JOBS=8
export USE_NINJA=1
export BUILD_TARGET="cuda"

python3 setup.py install

# Then create a wheel to save.
MAX_JOBS=8 python3 -m pip wheel . -v -w dist


pushd /usr/local/lib/python3.12/dist-packages
mv xformers_backup xformers
popd
#Keep this container running.

# On Thor, in a second terminal save the new image

docker commit vllm1 vllm:flashattn

# Back in vllm1 container. Exit it.

exit

Now start VLLM

# Commit memory and then clear it.
sudo sync &&  echo 3 |sudo tee /proc/sys/vm/drop_caches

# Choose your desired --model and substitute below. Here we use \
# nvidia/Llama-3.1-8B-Instruct-FP4 to use less memory & test FP4 model.

# A. Is for most use case.

docker run --name vllm --rm -it --network host \
  --runtime=nvidia --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=4g \
  -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \
  -e HF_HOME=/root/.cache/huggingface \
  -v "$HOME/.cache:/root/.cache" \
  vllm:flashattn \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Llama-3.1-8B-Instruct-FP4 \
    --download-dir /root/.cache/huggingface \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 512 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.30 \
    --kv-cache-dtype=auto \
    --enforce-eager \
    --chat-template-content-format string


# and/or  B. Enable CUDA graphs. This will do a little bit of compilation.

docker run --name vllm --rm -it --network host \
  --runtime=nvidia --gpus all --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=4g \
  -e VLLM_USE_V1=1 -e VLLM_WORKER_MULTIPROC=0 \
  -e HF_HOME=/root/.cache/huggingface \
  -v "$HOME/.cache:/root/.cache" \
  vllm:flashattn \
  python3 -m vllm.entrypoints.openai.api_server \
    --model nvidia/Llama-3.1-8B-Instruct-FP4 \
    --download-dir /root/.cache/huggingface/hub \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 512 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.30 \
    --kv-cache-dtype=auto


# Following is one method to interact with VLLM. If wanted change the question. 

curl -X 'POST' 'http://127.0.0.1:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"model": "nvidia/Llama-3.1-8B-Instruct-FP4",
"messages": [{"role":"user", "content": "What are Chihuahuas famous for?"}]
}' |jq

When done with VLLM ‘control-c’ will

(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
1 Like

Thanks @whitesscott. Just trying to understand.. We can just launch the vllm container on Thor and use vllm serve commands to start serving LLM and VLM containers.

The pynvml warning did not impact the serving.

So not sure why we have to do above steps.. VLLM container should just work on Thor.

2 Likes

Hi, @whitesscott

Thanks for the testing and feedback.

Just want to clarify that the vLLM container can work directly without extra steps.
For example, the nvidia/Llama-3.1-8B-Instruct-FP4 model shared in your testing.
It can be launched with the following command:

$ sudo docker run --rm -it --network host --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia --name=vllm nvcr.io/nvidia/vllm:25.09-py3
# VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve "nvidia/Llama-3.1-8B-Instruct-FP4" \
--port "8000" \
--host "0.0.0.0" \
--trust_remote_code \
--swap-space 16 \
--max-seq-len 32000 \
--max-model-len 32000 \
--tensor-parallel-size 1 \
--max-num-seqs 1024 \
--gpu-memory-utilization 0.8
...
(APIServer pid=176) INFO:     Started server process [176]
(APIServer pid=176) INFO:     Waiting for application startup.
(APIServer pid=176) INFO:     Application startup complete.

Thanks.

2 Likes

Thanks for listing some models to run on the Thor in the new vLLM container. I ran Llama-3.3 70B Instruct FP4 but when I tried the vision model you listed (Llama-3.1 Nemotron Nano VL 8B V1) I got several errors about missing python libraries. The first was timm, which is added with “pip install timm” but the second “open_clip” wouldn’t install:

ERROR: Could not find a version that satisfies the requirement open_clip (from versions: none)

ERROR: No matching distribution found for open_clip

So I would say this model doesn’t work on Thor.

I also note that every time I switch models I have to go into jtop and clear the memory cache manually. Am I missing something?

This is an open source, current version 9/21/25 and has 12.7k stars on github.

pip install open-clip-torch

This is a good method to commit and then clear memory.

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

If you don’t have sync, you can install with

sudo apt install coreutils

1 Like

When I try openai/gpt-oss-120b, vllm sever launched, but when I try to call API “chat/completions”, it keep output ‘NoneNoneNone….”

Hi,

There is a known Harmony encoding error when running gpt-oss with vLLM.
Have you applied the workaround that was mentioned in the comment below:

Thanks.

yes, I did it. (same for 20b, and it is ok with other model like Qwen, tested on Jetons Thor)

The same solution is work for Jetson Orin with local deployed vllm.

Here is the output with “stream”:True:

data: {“id”:“chatcmpl-5b74052637d543b1b4300d4a74a6755c”,“object”:“chat.completion.chunk”,“created”:1759936479,“model”:“openai/gpt-oss-120b”,“choices”:[{“index”:0,“delta”:{“role”:“assistant”,“content”:“”},“logprobs”:null,“finish_reason”:null}]}  data: {“id”:“chatcmpl-5b74052637d543b1b4300d4a74a6755c”,“object”:“chat.completion.chunk”,“created”:1759936479,“model”:“openai/gpt-oss-120b”,“choices”:[{“index”:0,“delta”:{“reasoning_content”:“”},“logprobs”:null,“finish_reason”:null}]}

Have a Llama-3.3 70B Instruct FP4 running on thor, noobie here. About 100G ram occupied and serving with Open-WEBUI smoothly on my Thor. The system randomly froze and window desktop lost response in a few hours, same symptom as Ollama+gps-oss:70b. Looks like driver issue, but freeze in few minutes not hours like in vllm.

FP4 looks promising, I want to get a DGX Spark cluster for <100B models running locally if “driver issue” is resolved, and even more to hook up a dozen of them to quantize a FB16→FB4 locally if possible.

Any news about SGLang please update us!

and one more thing here need help: I really want to have a test run with this FP4 version GLM from HG:

thanks for trying out the container and the model! for Llama-3.1-Nemotron-Nano-VL-8B-V1, you need to install timm and open_clip. you can do both at once with pip install timm open-clip-torch
and then load the model.
example: vllm serve “nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1” --trust_remote_code --tensor-parallel-size 1 --gpu-memory-utilization 0.8
please note that sometimes this particular model won’t load on the first try if that were to happen please clear vLLM’s cache and try again!

1 Like

Has anyone noticed that vLLM in this image (nvcr.io/nvidia/vllm:25.09-py3) works the first time after boot, but fails after restarting? If I stop the server with Ctrl-C and start it again with the same command, it usually fails with:

(EngineCore_0 pid=748) ERROR 10-11 02:17:54 [core.py:700] 
ValueError: Free memory on device (34.7/122.82 GiB) on startup is less than desired GPU memory utilization (0.7, 85.98 GiB). 
Decrease GPU memory utilization or reduce GPU memory used by other processes.

It seems this vLLM version doesn’t release allocated VRAM/memory on exit.

without a container, running on host, similar issue. I believe it’s a cuda driver issue.

1 Like

This is a good method to commit and clear memory.

sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

If you don’t have sync, you can install with:

sudo apt install coreutils

This is the line in docker run command that I think helps too

  --ulimit memlock=-1 --ulimit stack=67108864 --shm-size=16g \

I’ve not worked with vllm command enough yet as I’ve used python to launch vllm application with these settings to reduce memory use. I believe you can add these settings or something close as arguments to vllm command. I’ve reduced gpu-memory-utilization ranges from 100 to 20 and vllm was functional.

    --max-model-len 512 \
    --max-num-seqs 2 \
    --gpu-memory-utilization 0.30 \
1 Like

I also have the memory release issue using VLLM container to run & test Nemotron Nano 9B v2 FP8 on Jetson ISO (r38.2-08-22) . It dose not free the memory using by CUDA almost completely.

thanks to whitesscott’ and shahizat(in Nvidia Jetson Thor memory release issue ), I can free memory without reboot the system. I wonder this is a common issue or just system feature?

1 Like

Hm. My attempt to run a model using vllm failed with a

(EngineCore_0 pid=215) INFO 10-12 16:17:37 [backends.py:559] Dynamo bytecode transform time: 17.74 s
/tmp/tmpua6uddqm/cuda_utils.c:1:10: fatal error: cuda.h: No such file or directory
1 | include “cuda.h”
| ^~~~~~~~
compilation terminated.

I’m not sure what to do about that.

whitescott kindly sent me a message that this probably is in pytorch. And a description of how to fix env settings like CPATH, C_INCLUDE_PATH and CPLUS_INCLUDE_PATH so the cuda include can be found.

Which makes me realise: I was assuming that the tritonserver vllm docker container itself contains cuda. I don’t have nvidia-cuda or nvidia-cuda-dev installed on the host OS.
In the Jetson AGX Thor Developer Kit - User Guide containers are described as an alternative to installing cuda on the host OS.
So now I’m wondering: does nvcr.io/nvidia/tritonserver:25.09-vllm-python-py3 include cuda?

What is the docker run command and vllm command that caused the error.


I wasn’t sure if you were in the docker container when encountered that error. I does not have cuda_utils.c, but it does have

cuda13.0 and

/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/jit/cuda/cuda.h
/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include/torch/cuda.h
1 Like