I have been trying to run qwen models e.g. qwen 3.6 35b a3b and qwen 3.5 35b a3b, qwen 3.5 9b on my jetson orin but i have been getting this error intermittently and I am forced to shut down the jetson orin physically which is very frustrating. It also causes errors with the other docker containers.
This is my docker compose for starting qwen 3.6 35b a3b
volumes:
vllm-data: {}
vllm-cache: {}
vlm:
init: true
stop_signal: SIGINT
stop_grace_period: 2m
container_name: vlm
image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin
shm_size: "2048Mb"
# restart: unless-stopped
environment:
- HF_HUB_OFFLINE=1 # comment if downloading weights for first time, then uncomment to avoid redownloading
command:
[
"vllm",
"serve",
"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit",
"--gpu-memory-utilization",
"0.60",
"--max-model-len",
"16384",
"--enable-prefix-caching",
"--limit-mm-per-prompt.image",
"30",
"--limit-mm-per-prompt.video",
"0",
"--reasoning-parser",
"qwen3",
"--enable-auto-tool-choice",
"--tool-call-parser",
"qwen3_coder",
"--allowed-local-media-path",
"/kvb",
]
volumes:
- vllm-data:/data
- vllm-cache:/root/.cache/vllm
ports:
- "8000:8000"
But the container hangs at this step and i notice that jtop also fails (cannot run jtop)
These are the logs
➜ dev dc logs -f vlm
vlm | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm | warnings.warn(
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299] █ █ █▄ ▄█
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299] █▄█▀ █ █ █ █ model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:233] non-default args: {'model_tag': 'cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit', 'allowed_local_media_path': '/kvb', 'max_model_len': 16384, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.6, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 30, 'video': 0}}
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [arg_utils.py:665] HF_HUB_OFFLINE is True, replace model_id [cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit] to model_path [/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3]
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
vlm | (APIServer pid=20) INFO 05-05 05:23:03 [model.py:1678] Using max model len 16384
vlm | (APIServer pid=20) WARNING 05-05 05:23:04 [config.py:441] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
vlm | (APIServer pid=20) INFO 05-05 05:23:04 [config.py:461] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
vlm | (APIServer pid=20) INFO 05-05 05:23:05 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
vlm | (APIServer pid=20) INFO 05-05 05:23:05 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
vlm | (APIServer pid=20) INFO 05-05 05:23:05 [vllm.py:790] Asynchronous scheduling is enabled.
vlm | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm | warnings.warn(
vlm | (EngineCore pid=71) INFO 05-05 05:23:33 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3', speculative_config=None, tokenizer='/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
vlm | (EngineCore pid=71) INFO 05-05 05:23:35 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.2:50469 backend=nccl
vlm | (EngineCore pid=71) INFO 05-05 05:23:35 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
vlm | (EngineCore pid=71) INFO 05-05 05:23:47 [gpu_model_runner.py:4735] Starting to load model /data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3...
vlm | (EngineCore pid=71) INFO 05-05 05:23:48 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
vlm | (EngineCore pid=71) INFO 05-05 05:23:48 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
vlm | (EngineCore pid=71) INFO 05-05 05:23:48 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
vlm | (EngineCore pid=71) INFO 05-05 05:23:48 [compressed_tensors_moe.py:194] Using CompressedTensorsWNA16MarlinMoEMethod
vlm | (EngineCore pid=71) INFO 05-05 05:23:48 [compressed_tensors_moe.py:1180] Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
vlm | (EngineCore pid=71) INFO 05-05 05:23:49 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
vlm | (EngineCore pid=71) INFO 05-05 05:23:49 [flash_attn.py:596] Using FlashAttention version 2
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:18<01:12, 18.21s/it]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:28<00:41, 13.78s/it]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:39<00:24, 12.29s/it]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:52<00:12, 12.48s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:58<00:00, 10.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:58<00:00, 11.63s/it]
vlm | (EngineCore pid=71)
vlm | (EngineCore pid=71) INFO 05-05 05:24:53 [default_loader.py:384] Loading weights took 58.35 seconds
vlm | (EngineCore pid=71) INFO 05-05 05:25:01 [gpu_model_runner.py:4820] Model loading took 22.41 GiB memory and 71.743337 seconds
vlm | (EngineCore pid=71) INFO 05-05 05:25:01 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
jetson_release
Software part of jetson-stats 4.5.2 - (c) 2026, Raffaello Bonghi
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.2.1 [L4T 36.4.7]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
- P-Number: p3701-0005
- Module: NVIDIA Jetson AGX Orin (64GB ram)
Platform:
- Distribution: Ubuntu 22.04 Jammy Jellyfish
- Release: 5.15.148-tegra
jtop:
- Version: 4.5.2
- Service: Active
Libraries:
- CUDA: 12.6.85
- cuDNN: 9.3.0
- TensorRT: 10.7.0.23
- VPI: 3.2.4.0
- Vulkan: 1.3.204
- OpenCV: 4.8.0 - with CUDA: NO
I am using vllm v0.19.0 (from ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin docker image)
My docker version
Client: Docker Engine - Community
Version: 29.4.0
API version: 1.54
Go version: go1.26.1
Git commit: 9d7ad9f
Built: Tue Apr 7 08:36:28 2026
OS/Arch: linux/arm64
Context: default
Server: Docker Engine - Community
Engine:
Version: 29.4.0
API version: 1.54 (minimum version 1.40)
Go version: go1.26.1
Git commit: daa0cb7
Built: Tue Apr 7 08:36:28 2026
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: v2.2.2
GitCommit: 301b2dac98f15c27117da5c8af12118a041a31d9
nvidia:
Version: 1.3.4
GitCommit: v1.3.4-0-gd6d73eb8
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Docker compose v5.1.2
After the container hangs, i cannot stop and remove the container too. The container becomes zombie and I have to force shut down the jetson orin (sudo reboot does not fix the issue, the docker container still remains zombie)
➜ dev dc down vlm
[+] down 0/1
⠏ Container vlm Stopping [+] down 1/1 90 ✘ Container vlm Error Error while Stopping 134.1s
Error response from daemon: cannot stop container: 78033465695bea92526c82ff726a94d39eab91f20ed84882b9655a398689f763: tried to kill container, but did not receive an exit event