when I use the docker to employ qwen 2.5 vl 3b model as the follow instruction on 64GB Jetson AGX Orin:
I only get the 30 token/s speed when test on vlm-bench.py,
the setting config is :
VLLM with quantization=w4a 16, max concurrency =8, input seq len = 2048 and output seg len = 128
but the official reported benchmark is 216 tokens/s which is super higher than my result:
I want to know the the official model is optimized by TensorRT-LLM or only by vllm deploy engine? And why the result is so difference, how should I optimize my method?
My pytorch version is: 2.3.0 + cuda 12.4
jetpack version is: 6.2.1+b38