Performance Inquiry: Optimizing Qwen3-VL 2B Inference for 2 QPS Target on Orin Nano Super

We are targeting 2 QPS (queries per second) with Qwen3-VL-2B-Instruct (Qwen/Qwen3-VL-2B-Instruct · Hugging Face) on Jetson Orin Nano Super (MAXN_SUPER mode).

Target Workload Specifications

  • Input: 100-200 text tokens + single image (1280×720 resolution)
  • Inference Mode: Single-token generation (prefill-only, no decode phase required)
  • Performance Goal: 2 QPS sustained throughput
  • Precision: FP16/BF16 (quantization excluded due to accuracy requirements)

Current Benchmark Results

Our evaluation with existing frameworks shows significant gap from the target:

Framework Throughput Gap vs. Target Notes
transformers (4.57.1) 0.89 QPS -55% Baseline implementation
llama.cpp( b7641) 0.53 QPS -73% Counter-intuitively slower than transformers

Identified Bottlenecks

(a) Runtime Memory Footprint
We evaluated SGLang deployment, but the runtime overhead consumes 1-2 GB of device memory. For Orin Nano’s constrained 8 GB shared memory environment, this footprint is prohibitively large for production deployment.

(b) Kernel Efficiency Issues
The SGLang community previously reported severe performance degradation in Qwen3-VL inference due to specific cuDNN version incompatibilities (see reference-link).

Notably, llama.cpp demonstrates inferior performance compared to native transformers (0.53 vs. 0.89 QPS), which is counter-intuitive given llama.cpp’s typical optimizations. We suspect llama.cpp may be encountering similar kernel-level inefficiencies or unoptimized attention implementations for Qwen3-VL’s visual architecture that remain unaddressed on Jetson.

Request for Guidance

Could you recommend a feasible deployment strategy to achieve the 2 QPS target for this vision-language workload on Orin Nano Super? Specifically:

  1. Are there Jetson-optimized inference engines (TensorRT-LLM, vLLM Jetson builds, or custom CUDA graphs) validated for Qwen3-VL 2B?

  2. What is the recommended approach to minimize runtime memory overhead while maintaining throughput for multimodal models?

  3. Are there specific cuDNN/cuBLAS versions or PyTorch builds known to resolve the kernel efficiency issues for Qwen2.5-VL/Qwen3-VL architectures on SM87?

Hi,

Please try the vllm container below:

Thanks.

@AastaLLL
Thanks, I have tried vllm, but vllm has a bug not completely fixed yet(https://github.com/vllm-project/vllm/issues/27992?sharetype=link),which cause Nano OOM at launch time.

Can you recommend other frameworks?