Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

I conducted a performance comparison of the Qwen3-30B-A3B-AWQ model on two NVIDIA Jetson devices.The model is downloaded from Qwen3-30B-A3B-AWQ · 模型库

Jetson Thor Test (192.168.1.168)

  • Model used: Qwen3-30B-A3B-AWQ

  • Container: nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3

  • vLLM command:

python3 -m vllm.entrypoints.openai.api_server
–model ./Qwen3-30B-A3B-AWQ/
–dtype auto
–tensor-parallel-size 1
–max-model-len 20480
–gpu-memory-utilization 0.8
–served-model-name qwen30b

Jetson Orin AGX Test (192.168.1.39)

  • Model used: same as above

  • vLLM installed via jp6/cu126 index, version 0.8.5(I downloaded this whl file 2 months ago, now is 0.10.2)

  • vLLM command:

python3 -m vllm.entrypoints.openai.api_server
–model /data/qwen3-30B
–dtype auto
–tensor-parallel-size 1
–max-model-len 20480
–gpu-memory-utilization 0.9
–served-model-name Qwen3-30B-A3B-AWQ

Testing via curl

Thor device:

curl -X POST http://192.168.1.168:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen30b”,
“messages”: [
{
“role”: “user”,
“content”: “Please provide a detailed analysis of the potential impacts of climate change on global agriculture over the next 50 years, considering factors such as changing weather patterns, water availability, soil quality, crop yields, pest populations, and economic consequences for farmers in different regions. Include possible mitigation strategies, technological innovations, and policy recommendations.”
}
],
“max_tokens”: 1024,
“temperature”: 0.7,
“top_p”: 0.95
}’

Orin AGX 64GB:

curl -X POST http://192.168.1.39:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen3-30B-A3B-AWQ”,
“messages”: [
{
“role”: “user”,
“content”: “Please provide a detailed analysis of the potential impacts of climate change on global agriculture over the next 50 years, considering factors such as changing weather patterns, water availability, soil quality, crop yields, pest populations, and economic consequences for farmers in different regions. Include possible mitigation strategies, technological innovations, and policy recommendations.”
}
],
“max_tokens”: 1024,
“temperature”: 0.7,
“top_p”: 0.95
}’

Monitoring vLLM service results:

  1. Jetson Thor: ~53.0 tokens/s
  2. Orin AGX 64GB: ~41.5 tokens/s

Question:

The Thor has 2000 TOPS FP4, while the Orin AGX 64GB has 275 TOPS INT8. Why is the performance difference so small? Could there be something wrong in my setup or testing method?

Hi,

You can find our benchmark data in the link below:

For Qwen3-30B-A3B, we got 226.42 output tokens/sec for Thor and 76.69 for Orin.
The steps to reproduce the Thor can be found in the link below:

Thanks.

thx a lot. I will try it soon!!

I followed your guide and pull the docker iamges:

I started it with the command of:

sudo docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged
-it -u 0:0 -v ~/my_models:/models --name=thor_vllm thor_vllm_container:25.08-py3-base
docker exec -it thor_vllm /bin/bash
cd script

I’ve downloaded the model from Qwen3-30B-A3B-quantized.w4a16 · 模型库

then I modified the run_vllm_llm_serve.sh as follow:

#!/bin/bash

# Simple script to serve a model with vLLM
# Usage: ./run_vllm_llm_serve.sh <model_name>

if [ -z "$1" ]; then
  echo "Usage: $0 <model_name>"
  exit 1
fi

MODEL_NAME="$1"
shift

# Run vllm serve with the given model name and any extra options
echo 3 | tee /proc/sys/vm/drop_caches
export VLLM_DISABLED_KERNELS=MacheteLinearKernel

# Set quantization flag based on the model name
if [ "$MODEL_NAME" = "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" ]; then
  QUANTIZATION="gptq"
else
  QUANTIZATION="compressed-tensors"
fi

sync && echo 3 | tee /proc/sys/vm/drop_caches && VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /models/Qwen3-30B-A3B-quantized.w4a16 --swap-space 16  --max-seq-len 32768 --max-model-len 32768 --tensor-parallel-size 1 --max-num-seqs 1024 --dtype auto --gpu-memory-utilization 0.80 --served-model-name qwen30b

then I used the command:

./run_vllm_llm_serve.sh

Then I used another device to make a curl request and monitored the Thor device’s vllm log:

However I got the same result?

It was still around 53 tokens/s?

When will TensorRT-LLM support Jetson Thor?