Hi.
I am trying to benchmark Qwen3-VL-8B-Instruct-AWQ-4bit on my Jetson AGX Orin 64GB.
I pulled the image ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin.Downloaded and served the model in Terminal 1 :
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve /model \ --port "8000" \ --host "0.0.0.0" \ --trust_remote_code \ --swap-space 1 \ --max-model-len 32000 \ --max-num-seqs 64 \ --gpu-memory-utilization 0.9
Tried to benchmark in Terminal 2 like the tutorial GenAI Benchmarking: LLMs and VLMs on Jetson | Jetson AI Lab but reported Muliti-modal content is only supported on ‘openai-chat’ and ‘openai-audio’ backends.
However, even I make command like:
vllm bench serve \
--backend openai-chat \
--base-url http://localhost:8000/v1/chat/completions
--dataset-name hf \
--dataset-path lmarena-ai/vision-arena-bench-v0.1 \
--hf-split train \
--model /model \
--num-prompts 50 \
--percentile-metrics ttft,tpot,itl,e2el \
--hf-output-len 128 \
--max-concurrency 1
It keeps reporting ValueError: OpenAI Chat Completions API URL must end with one of: {‘chat/completions’, ‘profile’}. Additionally, I found in Terminal 1, there is GET /v1/chat/completions/metrics HTTP/1.1" 404 Not Found.
I wonder how to benchmark VLM on Orin like the tutorial.
Thanks.
Hi,
Could you check if the instructions in the link below first:
Thanks.
Okay. I noticed that the docker image used in your link is dustynv/vllm:r36.4-cu129-24.04.
So what to do about the official Package vllm · GitHub ? Should I just dicard it and pull the dustynv one? My Orin is JetPack 6.2.1 as well.
And another question please:
I have set to MAXN mode and sudo jetson_clocks but it keeps warning System throttled due to Over-current while benchmarking every time.
Thanks.
Hi,
You can try the one from GitHub.
The vLLM version is different and the GitHub one should be newer.
Throttling is expected when running GPU-heavy loading under performance mode.
The warning can be turned off. Please check below comment:
Hi,
We test a custom nvpmodel with 2x CPU@1728, GPU@1020 and EMC@3199.
Under the heavy load GPU tasks (ex. LLM inference), we can still observe the OC throttling behavior.
But similar to the previous testing, the impact on the performance is limited as the clocks don’t decrease much. (1019->1003)
01-22-2025 06:53:47 RAM 7119/7620MB (lfb 4x4MB) SWAP 580/16384MB (cached 155MB) CPU [0%@1728,2%@1728,off,off,off,off] EMC_FREQ 43%@3199 GR3D_FREQ 99%@[1008] NVDEC off NVJPG off NVJPG1 off VIC off OF…
Thanks.
Do you mean throttling has no impact after sudo jetson_clocks and I can just turn it off? Through the tries I made, the best performance is conducted under the original MAXN mode with throttling, better than constrained CPU clocks without throttling.
In my case, 0C3 was triggered and it said over-current even after I constrained the max frequency of CPU to as same as the 30W mode and others remain like MAXN.
Thanks.
Hi,
Throttling will decrease the clocks but you can leave it running in the background and turn off the warning message.
Thanks.
system
Closed
March 23, 2026, 7:34am
11
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.