The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin

when I use the docker to employ qwen 2.5 vl 3b model as the follow instruction on 64GB Jetson AGX Orin:

I only get the 30 token/s speed when test on vlm-bench.py,

the setting config is :
VLLM with quantization=w4a 16, max concurrency =8, input seq len = 2048 and output seg len = 128

but the official reported benchmark is 216 tokens/s which is super higher than my result:

I want to know the the official model is optimized by TensorRT-LLM or only by vllm deploy engine? And why the result is so difference, how should I optimize my method?

My pytorch version is: 2.3.0 + cuda 12.4
jetpack version is: 6.2.1+b38

Hi,
We are checking with our team. Will share information about how we did the benchmark.

Hi,

Thanks for your patience.

Please find the steps in the link below to reproduce the benchmark results.
We can get 225.65 output token throughput on an AGX Orin 64GB + developer kit.

Thanks.