Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB

I followed your guide and pull the docker iamges:

I started it with the command of:

sudo docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged
-it -u 0:0 -v ~/my_models:/models --name=thor_vllm thor_vllm_container:25.08-py3-base
docker exec -it thor_vllm /bin/bash
cd script

I’ve downloaded the model from Qwen3-30B-A3B-quantized.w4a16 · 模型库

then I modified the run_vllm_llm_serve.sh as follow:

#!/bin/bash

# Simple script to serve a model with vLLM
# Usage: ./run_vllm_llm_serve.sh <model_name>

if [ -z "$1" ]; then
  echo "Usage: $0 <model_name>"
  exit 1
fi

MODEL_NAME="$1"
shift

# Run vllm serve with the given model name and any extra options
echo 3 | tee /proc/sys/vm/drop_caches
export VLLM_DISABLED_KERNELS=MacheteLinearKernel

# Set quantization flag based on the model name
if [ "$MODEL_NAME" = "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16" ]; then
  QUANTIZATION="gptq"
else
  QUANTIZATION="compressed-tensors"
fi

sync && echo 3 | tee /proc/sys/vm/drop_caches && VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve /models/Qwen3-30B-A3B-quantized.w4a16 --swap-space 16  --max-seq-len 32768 --max-model-len 32768 --tensor-parallel-size 1 --max-num-seqs 1024 --dtype auto --gpu-memory-utilization 0.80 --served-model-name qwen30b

then I used the command:

./run_vllm_llm_serve.sh

Then I used another device to make a curl request and monitored the Thor device’s vllm log:

However I got the same result?

It was still around 53 tokens/s?