Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

ckdavid233 · September 19, 2025, 6:10am

I conducted a performance comparison of the Qwen3-30B-A3B-AWQ model on two NVIDIA Jetson devices.The model is downloaded from Qwen3-30B-A3B-AWQ · 模型库

Jetson Thor Test (192.168.1.168)

Model used: Qwen3-30B-A3B-AWQ
Container: nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3
vLLM command:

python3 -m vllm.entrypoints.openai.api_server
–model ./Qwen3-30B-A3B-AWQ/
–dtype auto
–tensor-parallel-size 1
–max-model-len 20480
–gpu-memory-utilization 0.8
–served-model-name qwen30b

Jetson Orin AGX Test (192.168.1.39)

Model used: same as above
vLLM installed via jp6/cu126 index, version 0.8.5(I downloaded this whl file 2 months ago, now is 0.10.2)
vLLM command:

python3 -m vllm.entrypoints.openai.api_server
–model /data/qwen3-30B
–dtype auto
–tensor-parallel-size 1
–max-model-len 20480
–gpu-memory-utilization 0.9
–served-model-name Qwen3-30B-A3B-AWQ

Testing via curl

Thor device:

curl -X POST http://192.168.1.168:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen30b”,
“messages”: [
{
“role”: “user”,
“content”: “Please provide a detailed analysis of the potential impacts of climate change on global agriculture over the next 50 years, considering factors such as changing weather patterns, water availability, soil quality, crop yields, pest populations, and economic consequences for farmers in different regions. Include possible mitigation strategies, technological innovations, and policy recommendations.”
}
],
“max_tokens”: 1024,
“temperature”: 0.7,
“top_p”: 0.95
}’

Orin AGX 64GB:

curl -X POST http://192.168.1.39:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen3-30B-A3B-AWQ”,
“messages”: [
{
“role”: “user”,
“content”: “Please provide a detailed analysis of the potential impacts of climate change on global agriculture over the next 50 years, considering factors such as changing weather patterns, water availability, soil quality, crop yields, pest populations, and economic consequences for farmers in different regions. Include possible mitigation strategies, technological innovations, and policy recommendations.”
}
],
“max_tokens”: 1024,
“temperature”: 0.7,
“top_p”: 0.95
}’

Monitoring vLLM service results:

Jetson Thor: ~53.0 tokens/s
Orin AGX 64GB: ~41.5 tokens/s

Question:

The Thor has 2000 TOPS FP4, while the Orin AGX 64GB has 275 TOPS INT8. Why is the performance difference so small? Could there be something wrong in my setup or testing method?

AastaLLL · September 19, 2025, 7:12am

Hi,

You can find our benchmark data in the link below:

For Qwen3-30B-A3B, we got 226.42 output tokens/sec for Thor and 76.69 for Orin.
The steps to reproduce the Thor can be found in the link below:

Thanks.

ckdavid233 · September 19, 2025, 7:17am

thx a lot. I will try it soon!!

ckdavid233 · September 19, 2025, 9:39am

I followed your guide and pull the docker iamges:

I started it with the command of:

sudo docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged
-it -u 0:0 -v ~/my_models:/models --name=thor_vllm thor_vllm_container:25.08-py3-base

docker exec -it thor_vllm /bin/bash
cd script

I’ve downloaded the model from Qwen3-30B-A3B-quantized.w4a16 · 模型库

then I modified the run_vllm_llm_serve.sh as follow:

#!/bin/bash

# Simple script to serve a model with vLLM
# Usage: ./run_vllm_llm_serve.sh <model_name>

if [ -z "$1" ]; then
  echo "Usage: $0 <model_name>"
  exit 1
fi

MODEL_NAME="$1"
shift

# Run vllm serve with the given model name and any extra options
echo 3 | tee /proc/sys/vm/drop_caches
export VLLM_DISABLED_KERNELS=MacheteLinearKernel

# Set quantization flag based on the model name
if [ "$MODEL_NAME" = "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" ]; then
  QUANTIZATION="gptq"
else
  QUANTIZATION="compressed-tensors"
fi

sync && echo 3 | tee /proc/sys/vm/drop_caches && VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /models/Qwen3-30B-A3B-quantized.w4a16 --swap-space 16  --max-seq-len 32768 --max-model-len 32768 --tensor-parallel-size 1 --max-num-seqs 1024 --dtype auto --gpu-memory-utilization 0.80 --served-model-name qwen30b

then I used the command:

./run_vllm_llm_serve.sh

Then I used another device to make a curl request and monitored the Thor device’s vllm log:

However I got the same result?

It was still around 53 tokens/s?

user66610 · September 19, 2025, 12:31pm

When will TensorRT-LLM support Jetson Thor?

changtimwu · September 22, 2025, 2:43am

Hi! I’m still waiting for a working VLLM port for THOR. I’m surprised a VLLM container demo appeared in the guide. I wonder if the examples in that guide work on Jetson Linux r38.2, since all L4T/Jetson Linux references there are r38.1.

AastaLLL · September 22, 2025, 5:06am

Hi, all

Sorry about the confusion.

The nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3 is the latest container we used for the Table.
We have updated the steps for JetPack 7.0 to the link below:

With the steps, we can get 354.74 output tokens/sec on Thor with Qwen2.5-VL-3B model.
Please update the command to use Qwen3-30B-A3B for your use case.

Thanks.

ckdavid233 · September 22, 2025, 5:08am

Thank you!I’d get it a try soon.

yuchaoz · September 25, 2025, 6:33am

why no access to this?

yuchaoz · September 25, 2025, 8:52am

Hi Sam,

Thanks for reply. Issue solved

Topic		Replies	Views
The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin Jetson AGX Orin generative_ai	2	78	September 22, 2025
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , nemotron	20	562	October 15, 2025
LLMs token/sec Jetson AGX Orin generative_ai	2	1110	April 8, 2024
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	25080	May 10, 2024
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit Jetson Projects jetson , generative_ai	1	653	December 8, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	630	May 12, 2025
Can someone tell me how to benchmark LLama_v2_7b model on jetson Orin AGX with different quantization methods? NVIDIA AI Workbench jetson , generative_ai	2	77	April 3, 2025
Geforce RTX 3090 versus Jetson AGX Xavier for inference in AI TensorRT jetson-inference	3	1112	October 12, 2021
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	19	1236	May 13, 2022
The token speed of LLM on Jetson AGX Orin Jetson AGX Orin generative_ai , llm , llama	4	127	September 25, 2025

Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

Related topics