We are targeting 2 QPS (queries per second) with Qwen3-VL-2B-Instruct (Qwen/Qwen3-VL-2B-Instruct · Hugging Face) on Jetson Orin Nano Super (MAXN_SUPER mode).
Target Workload Specifications
- Input: 100-200 text tokens + single image (1280×720 resolution)
- Inference Mode: Single-token generation (prefill-only, no decode phase required)
- Performance Goal: 2 QPS sustained throughput
- Precision: FP16/BF16 (quantization excluded due to accuracy requirements)
Current Benchmark Results
Our evaluation with existing frameworks shows significant gap from the target:
| Framework | Throughput | Gap vs. Target | Notes |
|---|---|---|---|
| transformers (4.57.1) | 0.89 QPS | -55% | Baseline implementation |
| llama.cpp( b7641) | 0.53 QPS | -73% | Counter-intuitively slower than transformers |
Identified Bottlenecks
(a) Runtime Memory Footprint
We evaluated SGLang deployment, but the runtime overhead consumes 1-2 GB of device memory. For Orin Nano’s constrained 8 GB shared memory environment, this footprint is prohibitively large for production deployment.
(b) Kernel Efficiency Issues
The SGLang community previously reported severe performance degradation in Qwen3-VL inference due to specific cuDNN version incompatibilities (see reference-link).
Notably, llama.cpp demonstrates inferior performance compared to native transformers (0.53 vs. 0.89 QPS), which is counter-intuitive given llama.cpp’s typical optimizations. We suspect llama.cpp may be encountering similar kernel-level inefficiencies or unoptimized attention implementations for Qwen3-VL’s visual architecture that remain unaddressed on Jetson.
Request for Guidance
Could you recommend a feasible deployment strategy to achieve the 2 QPS target for this vision-language workload on Orin Nano Super? Specifically:
-
Are there Jetson-optimized inference engines (TensorRT-LLM, vLLM Jetson builds, or custom CUDA graphs) validated for Qwen3-VL 2B?
-
What is the recommended approach to minimize runtime memory overhead while maintaining throughput for multimodal models?
-
Are there specific cuDNN/cuBLAS versions or PyTorch builds known to resolve the kernel efficiency issues for Qwen2.5-VL/Qwen3-VL architectures on SM87?