Problem running Qwen Models via vllm on Jetson Orin

I have been trying to run qwen models e.g. qwen 3.6 35b a3b and qwen 3.5 35b a3b, qwen 3.5 9b on my jetson orin but i have been getting this error intermittently and I am forced to shut down the jetson orin physically which is very frustrating. It also causes errors with the other docker containers.

This is my docker compose for starting qwen 3.6 35b a3b

volumes:
  vllm-data: {}
  vllm-cache: {}

vlm:
  init: true
  stop_signal: SIGINT
  stop_grace_period: 2m
  container_name: vlm
  image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin
  shm_size: "2048Mb"
  # restart: unless-stopped
  environment:
    - HF_HUB_OFFLINE=1 # comment if downloading weights for first time, then uncomment to avoid redownloading
  command:
    [
      "vllm",
      "serve",
      "cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit",
      "--gpu-memory-utilization",
      "0.60",
      "--max-model-len",
      "16384",
      "--enable-prefix-caching",
      "--limit-mm-per-prompt.image",
      "30",
      "--limit-mm-per-prompt.video",
      "0",
      "--reasoning-parser",
      "qwen3",
      "--enable-auto-tool-choice",
      "--tool-call-parser",
      "qwen3_coder",
      "--allowed-local-media-path",
      "/kvb",
    ]

  volumes:
    - vllm-data:/data
    - vllm-cache:/root/.cache/vllm

  ports:
    - "8000:8000"


But the container hangs at this step and i notice that jtop also fails (cannot run jtop)

These are the logs

➜  dev dc logs -f vlm
vlm  | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm  |   warnings.warn(
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]        █     █     █▄   ▄█
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]   █▄█▀ █     █     █     █  model   cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:299]
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [utils.py:233] non-default args: {'model_tag': 'cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'model': 'cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit', 'allowed_local_media_path': '/kvb', 'max_model_len': 16384, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.6, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 30, 'video': 0}}
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [arg_utils.py:665] HF_HUB_OFFLINE is True, replace model_id [cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit] to model_path [/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3]
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [model.py:549] Resolved architecture: Qwen3_5MoeForConditionalGeneration
vlm  | (APIServer pid=20) INFO 05-05 05:23:03 [model.py:1678] Using max model len 16384
vlm  | (APIServer pid=20) WARNING 05-05 05:23:04 [config.py:441] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
vlm  | (APIServer pid=20) INFO 05-05 05:23:04 [config.py:461] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
vlm  | (APIServer pid=20) INFO 05-05 05:23:05 [config.py:281] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
vlm  | (APIServer pid=20) INFO 05-05 05:23:05 [config.py:312] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
vlm  | (APIServer pid=20) INFO 05-05 05:23:05 [vllm.py:790] Asynchronous scheduling is enabled.
vlm  | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm  |   warnings.warn(
vlm  | (EngineCore pid=71) INFO 05-05 05:23:33 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3', speculative_config=None, tokenizer='/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
vlm  | (EngineCore pid=71) INFO 05-05 05:23:35 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.18.0.2:50469 backend=nccl
vlm  | (EngineCore pid=71) INFO 05-05 05:23:35 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
vlm  | (EngineCore pid=71) INFO 05-05 05:23:47 [gpu_model_runner.py:4735] Starting to load model /data/models/huggingface/models--cyankiwi--Qwen3.6-35B-A3B-AWQ-4bit/snapshots/7a1c0c26c56ee56f98bfdb77124acf5b239eabf3...
vlm  | (EngineCore pid=71) INFO 05-05 05:23:48 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
vlm  | (EngineCore pid=71) INFO 05-05 05:23:48 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
vlm  | (EngineCore pid=71) INFO 05-05 05:23:48 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
vlm  | (EngineCore pid=71) INFO 05-05 05:23:48 [compressed_tensors_moe.py:194] Using CompressedTensorsWNA16MarlinMoEMethod
vlm  | (EngineCore pid=71) INFO 05-05 05:23:48 [compressed_tensors_moe.py:1180] Using Marlin backend for WNA16 MoE (group_size=32, num_bits=4)
vlm  | (EngineCore pid=71) INFO 05-05 05:23:49 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
vlm  | (EngineCore pid=71) INFO 05-05 05:23:49 [flash_attn.py:596] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:18<01:12, 18.21s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:28<00:41, 13.78s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:39<00:24, 12.29s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:52<00:12, 12.48s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:58<00:00, 10.14s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:58<00:00, 11.63s/it]
vlm  | (EngineCore pid=71)
vlm  | (EngineCore pid=71) INFO 05-05 05:24:53 [default_loader.py:384] Loading weights took 58.35 seconds
vlm  | (EngineCore pid=71) INFO 05-05 05:25:01 [gpu_model_runner.py:4820] Model loading took 22.41 GiB memory and 71.743337 seconds
vlm  | (EngineCore pid=71) INFO 05-05 05:25:01 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.

jetson_release

Software part of jetson-stats 4.5.2 - (c) 2026, Raffaello Bonghi
Model: NVIDIA Jetson AGX Orin Developer Kit - Jetpack 6.2.1 [L4T 36.4.7]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - P-Number: p3701-0005
 - Module: NVIDIA Jetson AGX Orin (64GB ram)
Platform:
 - Distribution: Ubuntu 22.04 Jammy Jellyfish
 - Release: 5.15.148-tegra
jtop:
 - Version: 4.5.2
 - Service: Active
Libraries:
 - CUDA: 12.6.85
 - cuDNN: 9.3.0
 - TensorRT: 10.7.0.23
 - VPI: 3.2.4.0
 - Vulkan: 1.3.204
 - OpenCV: 4.8.0 - with CUDA: NO

I am using vllm v0.19.0 (from ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin docker image)

My docker version

Client: Docker Engine - Community
 Version:           29.4.0
 API version:       1.54
 Go version:        go1.26.1
 Git commit:        9d7ad9f
 Built:             Tue Apr  7 08:36:28 2026
 OS/Arch:           linux/arm64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          29.4.0
  API version:      1.54 (minimum version 1.40)
  Go version:       go1.26.1
  Git commit:       daa0cb7
  Built:            Tue Apr  7 08:36:28 2026
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          v2.2.2
  GitCommit:        301b2dac98f15c27117da5c8af12118a041a31d9
 nvidia:
  Version:          1.3.4
  GitCommit:        v1.3.4-0-gd6d73eb8
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Docker compose v5.1.2

After the container hangs, i cannot stop and remove the container too. The container becomes zombie and I have to force shut down the jetson orin (sudo reboot does not fix the issue, the docker container still remains zombie)

➜  dev dc down vlm
[+] down 0/1
 ⠏ Container vlm Stopping                                                                                                                                    [+] down 1/1                                                                                                                                               90 ✘ Container vlm Error Error while Stopping                                                                                                            134.1s
Error response from daemon: cannot stop container: 78033465695bea92526c82ff726a94d39eab91f20ed84882b9655a398689f763: tried to kill container, but did not receive an exit event

Hi,

It looks like you follow the command that was shared in the tutorial below:

Is there any error, or is the device rebooting automatically?
If the scenario is hanging, could you try to wait a little bit longer to see if there’s any difference?

Is your device AGX Orin 64GB?
Thanks.

Yes i followed the command from the tutorial above. The device does not reboot automatically. I have to physically reboot the device by pressing the power button to shut down and then pressing the power button to power it up. There is no error appearing, the container just hangs and eventually becomes unhealthy (but i cannot stop it i will get

Error response from daemon: cannot stop container: 78033465695bea92526c82ff726a94d39eab91f20ed84882b9655a398689f763: tried to kill container, but did not receive an exit event

I cannot start other docker containers; it seems to mess up the docker entirely and running sudo systemctl restart docker hangs the terminal.

There is no difference in waiting longer, i have waited up to 10 minutes but the container still hangs. I can only force shut down and power it up again.

Yes device is AGX Orin 64GB

Hi,

10 minutes may not be enough since it requires some compiling work on the first time launch.
Could you wait for ~1 hour to see if there is any difference?

Thanks.

I have tried similar setup with qwen 3.5 27b

i experienced the same problem and left the machine for the night over 8 hours. I started the vllm docker container running qwen 3.5 27b at 10 May 23:33:16 and at 11 may 08:47 it is still hanging and the container is unhealthy and i cannot kill it. I cannot start other docker containers too, jtop is not working and sudo systemctl restart docker hangs. I have to physically reboot the machine again.

This is my docker compose for starting qwen 3.5 27b

  vlm:
    init: true
    stop_signal: SIGINT
    stop_grace_period: 2m
    container_name: vlm
    network_mode: host
    image: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin
    restart: unless-stopped
    shm_size: "2048m"
    depends_on:
      init-dirs:
        condition: service_completed_successfully
    volumes:
      - vllm_data:/data
      - vllm_cache:/root/.cache/vllm
      - /kvb:/kvb # from init-dirs

    environment:
      - HF_HUB_OFFLINE=1 # comment if downloading weights for first time, then uncomment to avoid redownloading

    healthcheck:
      test:
        - CMD
        - curl
        - -f
        - http://127.0.0.1:8000/health
      interval: 10s
      retries: 10
      start_period: 480s
    command:
      - vllm
      - serve
      - Kbenkhaled/Qwen3.5-27B-quantized.w4a16
      - --gpu-memory-utilization
      - "0.50" # weights take 17.55GB, total non KV-cache memory ~27.33GB (incl. weights)
      - --max-model-len
      - "16384"
      - --enable-prefix-caching
      - '--limit-mm-per-prompt={"image": 10, "video": 0}'
      - --reasoning-parser
      - qwen3
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder
      - --allowed-local-media-path
      - /kvb
➜  date
Mon May 11 08:47:31 AM +08 2026

➜ dc logs -f vlm
vlm  | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm  |   warnings.warn(
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]        █     █     █▄   ▄█
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]   █▄█▀ █     █     █     █  model   Kbenkhaled/Qwen3.5-27B-quantized.w4a16
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:299]
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [utils.py:233] non-default args: {'model_tag': 'Kbenkhaled/Qwen3.5-27B-quantized.w4a16', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'disable_access_log_for_endpoints': '/health', 'model': 'Kbenkhaled/Qwen3.5-27B-quantized.w4a16', 'allowed_local_media_path': '/kvb', 'max_model_len': 16384, 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.5, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 10, 'video': 0}}
vlm  | (APIServer pid=20) INFO 05-10 15:31:45 [arg_utils.py:665] HF_HUB_OFFLINE is True, replace model_id [Kbenkhaled/Qwen3.5-27B-quantized.w4a16] to model_path [/data/models/huggingface/models--Kbenkhaled--Qwen3.5-27B-quantized.w4a16/snapshots/0ae8cdec5630daabdcb3690d2218f3137ce5b507]
vlm  | (APIServer pid=20) INFO 05-10 15:31:46 [model.py:549] Resolved architecture: Qwen3_5ForConditionalGeneration
vlm  | (APIServer pid=20) INFO 05-10 15:31:46 [model.py:1678] Using max model len 16384
vlm  | (APIServer pid=20) WARNING 05-10 15:31:46 [config.py:441] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
vlm  | (APIServer pid=20) INFO 05-10 15:31:46 [config.py:461] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
vlm  | (APIServer pid=20) INFO 05-10 15:31:48 [config.py:281] Setting attention block size to 784 tokens to ensure that attention page size is >= mamba page size.
vlm  | (APIServer pid=20) INFO 05-10 15:31:48 [config.py:312] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
vlm  | (APIServer pid=20) INFO 05-10 15:31:48 [vllm.py:790] Asynchronous scheduling is enabled.
vlm  | /opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
vlm  |   warnings.warn(
vlm  | (EngineCore pid=119) INFO 05-10 15:32:19 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='/data/models/huggingface/models--Kbenkhaled--Qwen3.5-27B-quantized.w4a16/snapshots/0ae8cdec5630daabdcb3690d2218f3137ce5b507', speculative_config=None, tokenizer='/data/models/huggingface/models--Kbenkhaled--Qwen3.5-27B-quantized.w4a16/snapshots/0ae8cdec5630daabdcb3690d2218f3137ce5b507', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/huggingface/models--Kbenkhaled--Qwen3.5-27B-quantized.w4a16/snapshots/0ae8cdec5630daabdcb3690d2218f3137ce5b507, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
vlm  | (EngineCore pid=119) INFO 05-10 15:32:22 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.10.51:58641 backend=nccl
vlm  | [rank0]:[W510 15:32:22.665739993 ProcessGroupGloo.cpp:511] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
vlm  | (EngineCore pid=119) INFO 05-10 15:32:23 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
vlm  | (EngineCore pid=119) INFO 05-10 15:32:37 [gpu_model_runner.py:4735] Starting to load model /data/models/huggingface/models--Kbenkhaled--Qwen3.5-27B-quantized.w4a16/snapshots/0ae8cdec5630daabdcb3690d2218f3137ce5b507...
vlm  | (EngineCore pid=119) INFO 05-10 15:32:38 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
vlm  | (EngineCore pid=119) INFO 05-10 15:32:38 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
vlm  | (EngineCore pid=119) INFO 05-10 15:32:39 [compressed_tensors_wNa16.py:112] Using MarlinLinearKernel for CompressedTensorsWNA16
vlm  | (EngineCore pid=119) INFO 05-10 15:32:39 [gdn_linear_attn.py:147] Using Triton/FLA GDN prefill kernel
vlm  | (EngineCore pid=119) INFO 05-10 15:32:39 [cuda.py:334] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
vlm  | (EngineCore pid=119) INFO 05-10 15:32:39 [flash_attn.py:596] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:28<00:00, 28.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:28<00:00, 28.45s/it]
vlm  | (EngineCore pid=119)
vlm  | (EngineCore pid=119) INFO 05-10 15:33:12 [default_loader.py:384] Loading weights took 28.86 seconds
vlm  | (EngineCore pid=119) INFO 05-10 15:33:16 [gpu_model_runner.py:4820] Model loading took 17.55 GiB memory and 36.553391 seconds
vlm  | (EngineCore pid=119) INFO 05-10 15:33:16 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.

jtop does not work too i.e. it hangs

➜  jtop
^C%