Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

ckdavid233 · September 19, 2025, 9:39am

I followed your guide and pull the docker iamges:

I started it with the command of:

sudo docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged
-it -u 0:0 -v ~/my_models:/models --name=thor_vllm thor_vllm_container:25.08-py3-base

docker exec -it thor_vllm /bin/bash
cd script

I’ve downloaded the model from Qwen3-30B-A3B-quantized.w4a16 · 模型库

then I modified the run_vllm_llm_serve.sh as follow:

#!/bin/bash

# Simple script to serve a model with vLLM
# Usage: ./run_vllm_llm_serve.sh <model_name>

if [ -z "$1" ]; then
  echo "Usage: $0 <model_name>"
  exit 1
fi

MODEL_NAME="$1"
shift

# Run vllm serve with the given model name and any extra options
echo 3 | tee /proc/sys/vm/drop_caches
export VLLM_DISABLED_KERNELS=MacheteLinearKernel

# Set quantization flag based on the model name
if [ "$MODEL_NAME" = "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" ]; then
  QUANTIZATION="gptq"
else
  QUANTIZATION="compressed-tensors"
fi

sync && echo 3 | tee /proc/sys/vm/drop_caches && VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /models/Qwen3-30B-A3B-quantized.w4a16 --swap-space 16  --max-seq-len 32768 --max-model-len 32768 --tensor-parallel-size 1 --max-num-seqs 1024 --dtype auto --gpu-memory-utilization 0.80 --served-model-name qwen30b

then I used the command:

./run_vllm_llm_serve.sh

Then I used another device to make a curl request and monitored the Thor device’s vllm log:

However I got the same result?

It was still around 53 tokens/s?

Topic		Replies	Views
The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin Jetson AGX Orin generative_ai	3	138	September 22, 2025
LLMs token/sec Jetson AGX Orin generative_ai	2	1137	April 8, 2024
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	25369	May 10, 2024
Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin Dev Kit Jetson Projects jetson , generative_ai	1	682	December 8, 2024
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , nemotron	23	1299	October 31, 2025
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	714	May 12, 2025
Can someone tell me how to benchmark LLama_v2_7b model on jetson Orin AGX with different quantization methods? NVIDIA AI Workbench jetson , generative_ai	2	95	April 3, 2025
The token speed of LLM on Jetson AGX Orin Jetson AGX Orin generative_ai , llm , llama	5	211	October 22, 2025
Geforce RTX 3090 versus Jetson AGX Xavier for inference in AI TensorRT jetson-inference	3	1137	October 12, 2021
Can't run nvcr.io/nvidia/l4t-tensorrt:r8.2.1-runtime on Orin AGX Jetson AGX Orin tensorrt	19	1296	May 13, 2022

Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

Related topics