Hi
We test this on Thor with JetPack 7.1.
The model can work without any issue. Please try it again:
$ sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
latest-jetson-thor: Pulling from nvidia-ai-iot/vllm
Digest: sha256:b587dd56b4cb076209ad5156a626ac75f5a976d0e8e7d1e6a9fccd56d1bd65e8
Status: Image is up to date for ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor
/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] █▄█▀ █ █ █ █ model cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:299]
(APIServer pid=1) INFO 04-10 03:18:04 [utils.py:233] non-default args: {'model_tag': 'cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', 'model': 'cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit'}
config.json: 7.53kB [00:00, 33.2MB/s]
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 753/753 [00:00<00:00, 10.4MB/s]
(APIServer pid=1) INFO 04-10 03:18:14 [model.py:549] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1) INFO 04-10 03:18:14 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 04-10 03:18:15 [vllm.py:790] Asynchronous scheduling is enabled.
tokenizer_config.json: 5.45kB [00:00, 29.7MB/s]
vocab.json: 2.78MB [00:00, 29.0MB/s]
merges.txt: 1.67MB [00:00, 152MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 75.7MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 8.80MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 6.66MB/s]
chat_template.jinja: 5.29kB [00:00, 30.3MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:00<00:00, 2.19MB/s]
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 861/861 [00:00<00:00, 8.77MB/s]
/opt/venv/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
(EngineCore pid=101) INFO 04-10 03:18:46 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=101) INFO 04-10 03:18:50 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.173.99.143:54629 backend=nccl
(EngineCore pid=101) INFO 04-10 03:18:50 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=101) INFO 04-10 03:19:09 [gpu_model_runner.py:4735] Starting to load model cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit...
(EngineCore pid=101) INFO 04-10 03:19:10 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=101) INFO 04-10 03:19:10 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=101) INFO 04-10 03:19:10 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=101) INFO 04-10 03:19:10 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
(EngineCore pid=101) INFO 04-10 03:19:11 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=101) INFO 04-10 03:19:11 [flash_attn.py:596] Using FlashAttention version 2
model.safetensors.index.json: 121kB [00:00, 313MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 2.55G/2.55G [01:49<00:00, 23.3MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [03:34<00:00, 23.3MB/s]
(EngineCore pid=101) INFO 04-10 03:22:49 [weight_utils.py:581] Time spent downloading weights for cpatonn/Qwen3-VL-8B-Instruct-AWQ-4bit: 215.894969 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.46s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:03<00:00, 1.78s/it]
(EngineCore pid=101)
(EngineCore pid=101) INFO 04-10 03:22:53 [default_loader.py:384] Loading weights took 3.70 seconds
(EngineCore pid=101) INFO 04-10 03:22:54 [gpu_model_runner.py:4820] Model loading took 7.37 GiB memory and 223.565475 seconds
(EngineCore pid=101) INFO 04-10 03:22:54 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=101) INFO 04-10 03:23:12 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/37ea329476/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=101) INFO 04-10 03:23:12 [backends.py:1111] Dynamo bytecode transform time: 9.87 s
(EngineCore pid=101) INFO 04-10 03:23:22 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=101) INFO 04-10 03:23:32 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 19.36 s
(EngineCore pid=101) INFO 04-10 03:23:34 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/7869b98bd6347361d90c877c245a07f5b60add6fc9cffafb3437e0c4d90b95b6/rank_0_0/model
(EngineCore pid=101) INFO 04-10 03:23:34 [monitor.py:48] torch.compile took 32.15 s in total
(EngineCore pid=101) INFO 04-10 03:24:00 [monitor.py:76] Initial profiling/warmup run took 25.68 s
(EngineCore pid=101) INFO 04-10 03:24:07 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=101) INFO 04-10 03:24:07 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.08 GiB total
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_worker.py:436] Available KV cache memory: 95.19 GiB
(EngineCore pid=101) INFO 04-10 03:24:09 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9007 to maintain the same effective KV cache size.
(EngineCore pid=101) INFO 04-10 03:24:09 [kv_cache_utils.py:1319] GPU KV cache size: 693,168 tokens
(EngineCore pid=101) INFO 04-10 03:24:09 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 2.64x
(EngineCore pid=101) 2026-04-10 03:24:13,300 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=101) 2026-04-10 03:24:13,323 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████| 51/51 [00:05<00:00, 9.16it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:02<00:00, 13.10it/s]
(EngineCore pid=101) INFO 04-10 03:24:22 [gpu_model_runner.py:6046] Graph capturing finished in 9 secs, took 0.01 GiB
(EngineCore pid=101) INFO 04-10 03:24:22 [gpu_worker.py:597] CUDA graph pool memory: 0.01 GiB (actual), 0.08 GiB (estimated), difference: 0.07 GiB (803.9%).
(EngineCore pid=101) INFO 04-10 03:24:22 [core.py:283] init engine (profile, create kv cache, warmup model) took 88.61 seconds
(APIServer pid=1) INFO 04-10 03:24:23 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=1) WARNING 04-10 03:24:24 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 04-10 03:24:35 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 04-10 03:24:47 [base.py:231] Multi-modal warmup completed in 11.244s
(APIServer pid=1) INFO 04-10 03:24:48 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 04-10 03:24:48 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
Thanks.