Gemma4-31B engine core failing

Following this procedure Gemma 4 31B | Jetson AI Lab on nvidia jetson thor.

engine core initialisation is failing.

(LLM) olpeleri@olpeleri:~/LLM_playground$ docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface Package vllm · GitHub vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --gpu-memory-utilization 0.8 --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4
gemma4-jetson-thor: Pulling from nvidia-ai-iot/vllm
Digest: sha256:570f9a5ffa89a772226abcc98c2d358a56ec3f755c97bc079c7f2396ffe62260
Status: Image is up to date for Package vllm · GitHub
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299]
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:299]
(APIServer pid=1) INFO 05-06 03:04:10 [utils.py:233] non-default args: {‘model_tag’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘gemma4’, ‘model’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘reasoning_parser’: ‘gemma4’, ‘gpu_memory_utilization’: 0.8}
(APIServer pid=1) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 9.46kB [00:00, 11.4MB/s]
processor_config.json: 1.69kB [00:00, 3.49MB/s]
(APIServer pid=1) INFO 05-06 03:04:23 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 03:04:23 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 05-06 03:04:23 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 05-06 03:04:23 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) WARNING 05-06 03:04:23 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 05-06 03:04:23 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 03:04:23 [compilation.py:290] Enabled custom fusions: act_quant
tokenizer_config.json: 2.09kB [00:00, 5.90MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:02<00:00, 11.4MB/s]
chat_template.jinja: 16.9kB [00:00, 11.8MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 903kB/s]
(EngineCore pid=122) INFO 05-06 03:04:42 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model=‘nvidia/Gemma-4-31B-IT-NVFP4’, speculative_config=None, tokenizer=‘nvidia/Gemma-4-31B-IT-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘gemma4’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore pid=122) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=122) INFO 05-06 03:04:45 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.48.66.216:42697 backend=nccl
(EngineCore pid=122) INFO 05-06 03:04:45 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] EngineCore failed to start.
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] super().init(
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] self._init_executor()
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 47, in _init_executor
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] self.driver_worker.init_device()
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py”, line 312, in init_device
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] self.worker.init_device() # type: ignore
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 283, in init_device
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py”, line 413, in request_memory
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] raise ValueError(
(EngineCore pid=122) ERROR 05-06 03:04:46 [core.py:1108] ValueError: Free memory on device cuda:0 (80.29/122.82 GiB) on startup is less than desired GPU memory utilization (0.8, 98.26 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(EngineCore pid=122) Process EngineCore:
(EngineCore pid=122) Traceback (most recent call last):
(EngineCore pid=122) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore pid=122) self.run()
(EngineCore pid=122) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore pid=122) self._target(*self._args, **self._kwargs)
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1112, in run_engine_core
(EngineCore pid=122) raise e
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=122) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) return func(*args, **kwargs)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=122) super().init(
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=122) self.model_executor = executor_class(vllm_config)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) return func(*args, **kwargs)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=122) self._init_executor()
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 47, in _init_executor
(EngineCore pid=122) self.driver_worker.init_device()
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py”, line 312, in init_device
(EngineCore pid=122) self.worker.init_device() # type: ignore
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=122) return func(*args, **kwargs)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 283, in init_device
(EngineCore pid=122) self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=122) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=122) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py”, line 413, in request_memory
(EngineCore pid=122) raise ValueError(
(EngineCore pid=122) ValueError: Free memory on device cuda:0 (80.29/122.82 GiB) on startup is less than desired GPU memory utilization (0.8, 98.26 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
[rank0]:[W506 03:04:46.711161802 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Redirecting… (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/opt/venv/bin/vllm”, line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py”, line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py”, line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Any pointers to a solution?

Bonus point - Anyone tried Mistral 4 Small?

Hi,

Based on the log above, you are running out of memory.
Please try to clean the memory usage before running the command:

$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Since the command occupies 80% GPU memory, please make sure no other frameworks are running at the same time.

Thanks.

My bad, I pasted the wrong traceback. It complains about KV which is surprising since I’m using the stock container for jetson usign the CLI documented on the page.

(LLM) olpeleri@olpeleri:~/LLM_playground$ sudo sync && sudo sh -c ‘echo 3 > /proc/sys/vm/drop_caches’
[sudo] password for olpeleri:
(LLM) olpeleri@olpeleri:~/LLM_playground$ docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface Package vllm · GitHub vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --gpu-memory-utilization 0.8 --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4
gemma4-jetson-thor: Pulling from nvidia-ai-iot/vllm
Digest: sha256:570f9a5ffa89a772226abcc98c2d358a56ec3f755c97bc079c7f2396ffe62260
Status: Image is up to date for Package vllm · GitHub
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299]
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299]
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:233] non-default args: {‘model_tag’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘gemma4’, ‘model’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘reasoning_parser’: ‘gemma4’, ‘gpu_memory_utilization’: 0.8}
(APIServer pid=1) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 9.46kB [00:00, 15.7MB/s]
processor_config.json: 1.69kB [00:00, 5.09MB/s]
(APIServer pid=1) INFO 05-06 07:55:15 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 07:55:15 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 05-06 07:55:15 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 05-06 07:55:15 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) WARNING 05-06 07:55:15 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 05-06 07:55:15 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 07:55:15 [compilation.py:290] Enabled custom fusions: act_quant
tokenizer_config.json: 2.09kB [00:00, 6.87MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 25.8MB/s]
chat_template.jinja: 16.9kB [00:00, 32.1MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.81MB/s]
(EngineCore pid=123) INFO 05-06 07:55:28 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model=‘nvidia/Gemma-4-31B-IT-NVFP4’, speculative_config=None, tokenizer=‘nvidia/Gemma-4-31B-IT-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘gemma4’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore pid=123) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=123) INFO 05-06 07:55:31 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.48.66.216:50511 backend=nccl
(EngineCore pid=123) INFO 05-06 07:55:31 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=123) INFO 05-06 07:55:32 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4…
(EngineCore pid=123) INFO 05-06 07:55:32 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=123) INFO 05-06 07:55:32 [compilation.py:290] Enabled custom fusions: act_quant
(EngineCore pid=123) INFO 05-06 07:55:33 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=123) INFO 05-06 07:55:33 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=123) INFO 05-06 07:55:33 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 175kB [00:00, 149MB/s]
(EngineCore pid=123) INFO 05-06 08:01:11 [weight_utils.py:581] Time spent downloading weights for nvidia/Gemma-4-31B-IT-NVFP4: 335.373990 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.34s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:04<00:04, 2.32s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:06<00:02, 2.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.51s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.84s/it]
(EngineCore pid=123)
(EngineCore pid=123) INFO 05-06 08:01:19 [default_loader.py:384] Loading weights took 7.44 seconds
(EngineCore pid=123) WARNING 05-06 08:01:19 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore pid=123) WARNING 05-06 08:01:19 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] EngineCore failed to start.
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] super().init(
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self._init_executor()
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 52, in _init_executor
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.driver_worker.load_model()
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 323, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4751, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model = model_loader.load_model(
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 81, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 107, in process_weights_after_loading
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] quant_method.process_weights_after_loading(module)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kv_cache.py”, line 80, in process_weights_after_loading
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] assert layer.k_scale > 0.0
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] AssertionError
(EngineCore pid=123) Process EngineCore:
(EngineCore pid=123) Traceback (most recent call last):
(EngineCore pid=123) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore pid=123) self.run()
(EngineCore pid=123) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore pid=123) self._target(*self._args, **self._kwargs)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1112, in run_engine_core
(EngineCore pid=123) raise e
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=123) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=123) super().init(
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=123) self.model_executor = executor_class(vllm_config)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=123) self._init_executor()
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 52, in _init_executor
(EngineCore pid=123) self.driver_worker.load_model()
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 323, in load_model
(EngineCore pid=123) self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4751, in load_model
(EngineCore pid=123) self.model = model_loader.load_model(
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 81, in load_model
(EngineCore pid=123) process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 107, in process_weights_after_loading
(EngineCore pid=123) quant_method.process_weights_after_loading(module)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kv_cache.py”, line 80, in process_weights_after_loading
(EngineCore pid=123) assert layer.k_scale > 0.0
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) AssertionError
[rank0]:[W506 08:01:20.368106112 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Redirecting… (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/opt/venv/bin/vllm”, line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py”, line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py”, line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

It fails in KV

(LLM) olpeleri@olpeleri:~/LLM_playground$ sudo sync && sudo sh -c ‘echo 3 > /proc/sys/vm/drop_caches’
[sudo] password for olpeleri:
(LLM) olpeleri@olpeleri:~/LLM_playground$ docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface Package vllm · GitHub vllm serve nvidia/Gemma-4-31B-IT-NVFP4 --gpu-memory-utilization 0.8 --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4
gemma4-jetson-thor: Pulling from nvidia-ai-iot/vllm
Digest: sha256:570f9a5ffa89a772226abcc98c2d358a56ec3f755c97bc079c7f2396ffe62260
Status: Image is up to date for Package vllm · GitHub
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299]
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] █▄█▀ █ █ █ █ model nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:299]
(APIServer pid=1) INFO 05-06 07:55:02 [utils.py:233] non-default args: {‘model_tag’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘enable_auto_tool_choice’: True, ‘tool_call_parser’: ‘gemma4’, ‘model’: ‘nvidia/Gemma-4-31B-IT-NVFP4’, ‘reasoning_parser’: ‘gemma4’, ‘gpu_memory_utilization’: 0.8}
(APIServer pid=1) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 9.46kB [00:00, 15.7MB/s]
processor_config.json: 1.69kB [00:00, 5.09MB/s]
(APIServer pid=1) INFO 05-06 07:55:15 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-06 07:55:15 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 05-06 07:55:15 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 05-06 07:55:15 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) WARNING 05-06 07:55:15 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 05-06 07:55:15 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-06 07:55:15 [compilation.py:290] Enabled custom fusions: act_quant
tokenizer_config.json: 2.09kB [00:00, 6.87MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 25.8MB/s]
chat_template.jinja: 16.9kB [00:00, 32.1MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.81MB/s]
(EngineCore pid=123) INFO 05-06 07:55:28 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model=‘nvidia/Gemma-4-31B-IT-NVFP4’, speculative_config=None, tokenizer=‘nvidia/Gemma-4-31B-IT-NVFP4’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘gemma4’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘cudagraph_mm_encoder’: False, ‘encoder_cudagraph_token_budgets’: , ‘encoder_cudagraph_max_images_per_batch’: 0, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘size_asserts’: False, ‘alignment_asserts’: False, ‘scalar_asserts’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: True, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore pid=123) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=123) INFO 05-06 07:55:31 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.48.66.216:50511 backend=nccl
(EngineCore pid=123) INFO 05-06 07:55:31 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=123) INFO 05-06 07:55:32 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4…
(EngineCore pid=123) INFO 05-06 07:55:32 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=123) INFO 05-06 07:55:32 [compilation.py:290] Enabled custom fusions: act_quant
(EngineCore pid=123) INFO 05-06 07:55:33 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=123) INFO 05-06 07:55:33 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=123) INFO 05-06 07:55:33 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 175kB [00:00, 149MB/s]
(EngineCore pid=123) INFO 05-06 08:01:11 [weight_utils.py:581] Time spent downloading weights for nvidia/Gemma-4-31B-IT-NVFP4: 335.373990 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.34s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:04<00:04, 2.32s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:06<00:02, 2.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.51s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:07<00:00, 1.84s/it]
(EngineCore pid=123)
(EngineCore pid=123) INFO 05-06 08:01:19 [default_loader.py:384] Loading weights took 7.44 seconds
(EngineCore pid=123) WARNING 05-06 08:01:19 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore pid=123) WARNING 05-06 08:01:19 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] EngineCore failed to start.
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] super().init(
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model_executor = executor_class(vllm_config)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self._init_executor()
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 52, in _init_executor
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.driver_worker.load_model()
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 323, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4751, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] self.model = model_loader.load_model(
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] return func(*args, **kwargs)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 81, in load_model
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 107, in process_weights_after_loading
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] quant_method.process_weights_after_loading(module)
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kv_cache.py”, line 80, in process_weights_after_loading
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] assert layer.k_scale > 0.0
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) ERROR 05-06 08:01:19 [core.py:1108] AssertionError
(EngineCore pid=123) Process EngineCore:
(EngineCore pid=123) Traceback (most recent call last):
(EngineCore pid=123) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore pid=123) self.run()
(EngineCore pid=123) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/multiprocessing/process.py”, line 108, in run
(EngineCore pid=123) self._target(*self._args, **self._kwargs)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1112, in run_engine_core
(EngineCore pid=123) raise e
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 1082, in run_engine_core
(EngineCore pid=123) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 848, in init
(EngineCore pid=123) super().init(
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py”, line 114, in init
(EngineCore pid=123) self.model_executor = executor_class(vllm_config)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore pid=123) self._init_executor()
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py”, line 52, in _init_executor
(EngineCore pid=123) self.driver_worker.load_model()
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py”, line 323, in load_model
(EngineCore pid=123) self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 4751, in load_model
(EngineCore pid=123) self.model = model_loader.load_model(
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore pid=123) return func(*args, **kwargs)
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py”, line 81, in load_model
(EngineCore pid=123) process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py”, line 107, in process_weights_after_loading
(EngineCore pid=123) quant_method.process_weights_after_loading(module)
(EngineCore pid=123) File “/opt/venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kv_cache.py”, line 80, in process_weights_after_loading
(EngineCore pid=123) assert layer.k_scale > 0.0
(EngineCore pid=123) ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=123) AssertionError
[rank0]:[W506 08:01:20.368106112 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see Redirecting… (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File “/opt/venv/bin/vllm”, line 10, in
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py”, line 75, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py”, line 122, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 195, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/asyncio/runners.py”, line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 670, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 684, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 100, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 210, in aenter
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py”, line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 225, in from_vllm_config
(APIServer pid=1) return cls(
(APIServer pid=1) ^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py”, line 154, in init
(APIServer pid=1) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 130, in make_async_mp_client
(APIServer pid=1) return AsyncMPClient(*client_args)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(APIServer pid=1) return func(*args, **kwargs)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 887, in init
(APIServer pid=1) super().init(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py”, line 535, in init
(APIServer pid=1) with launch_core_engines(
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File “/root/.local/share/uv/python/cpython-3.12.13-linux-aarch64-gnu/lib/python3.12/contextlib.py”, line 144, in exit
(APIServer pid=1) next(self.gen)
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 998, in launch_core_engines
(APIServer pid=1) wait_for_engine_startup(
(APIServer pid=1) File “/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py”, line 1057, in wait_for_engine_startup
(APIServer pid=1) raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Hi,

Just try this internally and the model can be deployed without issue.
Could you try it again?

$ # Serve command
sudo docker run -it --rm --pull always \
--runtime=nvidia --network host \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-thor \
vllm serve nvidia/Gemma-4-31B-IT-NVFP4 \
--gpu-memory-utilization 0.8 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4
gemma4-jetson-thor: Pulling from nvidia-ai-iot/vllm
ffb6db6f5af7: Pull complete 
Digest: sha256:570f9a5ffa89a772226abcc98c2d358a56ec3f755c97bc079c7f2396ffe62260
Status: Downloaded newer image for ghcr.io/nvidia-ai-iot/vllm:gemma4-jetson-thor
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299] 
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.0
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299]   █▄█▀ █     █     █     █  model   nvidia/Gemma-4-31B-IT-NVFP4
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:299] 
(APIServer pid=1) INFO 05-07 05:25:13 [utils.py:233] non-default args: {'model_tag': 'nvidia/Gemma-4-31B-IT-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'nvidia/Gemma-4-31B-IT-NVFP4', 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.8}
(APIServer pid=1) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
config.json: 9.46kB [00:00, 17.8MB/s]
processor_config.json: 1.69kB [00:00, 5.16MB/s]
(APIServer pid=1) INFO 05-07 05:25:22 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=1) INFO 05-07 05:25:22 [model.py:1678] Using max model len 262144
(APIServer pid=1) INFO 05-07 05:25:23 [cache.py:227] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 05-07 05:25:23 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=1) WARNING 05-07 05:25:23 [modelopt.py:998] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 05-07 05:25:23 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-07 05:25:23 [compilation.py:290] Enabled custom fusions: act_quant
tokenizer_config.json: 2.09kB [00:00, 6.15MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:01<00:00, 17.5MB/s]
chat_template.jinja: 16.9kB [00:00, 33.6MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.67MB/s]
(EngineCore pid=123) INFO 05-07 05:25:38 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='nvidia/Gemma-4-31B-IT-NVFP4', speculative_config=None, tokenizer='nvidia/Gemma-4-31B-IT-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Gemma-4-31B-IT-NVFP4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=123) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(EngineCore pid=123) INFO 05-07 05:25:42 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.173.99.143:40515 backend=nccl
(EngineCore pid=123) INFO 05-07 05:25:42 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=123) INFO 05-07 05:25:43 [gpu_model_runner.py:4735] Starting to load model nvidia/Gemma-4-31B-IT-NVFP4...
(EngineCore pid=123) INFO 05-07 05:25:43 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=123) INFO 05-07 05:25:43 [compilation.py:290] Enabled custom fusions: act_quant
(EngineCore pid=123) INFO 05-07 05:25:44 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=123) INFO 05-07 05:25:44 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore pid=123) INFO 05-07 05:25:44 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 175kB [00:00, 153MB/s]
(EngineCore pid=123) INFO 05-07 05:26:50 [weight_utils.py:581] Time spent downloading weights for nvidia/Gemma-4-31B-IT-NVFP4: 63.173920 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:05<00:15,  5.08s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:10<00:10,  5.45s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:16<00:05,  5.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  4.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:18<00:00,  4.69s/it]
(EngineCore pid=123) 
(EngineCore pid=123) INFO 05-07 05:27:09 [default_loader.py:384] Loading weights took 18.96 seconds
(EngineCore pid=123) WARNING 05-07 05:27:09 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore pid=123) WARNING 05-07 05:27:09 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore pid=123) INFO 05-07 05:27:10 [gpu_model_runner.py:4820] Model loading took 31.04 GiB memory and 86.151217 seconds
(EngineCore pid=123) INFO 05-07 05:27:10 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 2496 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=123) INFO 05-07 05:27:52 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/76a50f446a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=123) INFO 05-07 05:27:52 [backends.py:1111] Dynamo bytecode transform time: 16.83 s
(EngineCore pid=123) [rank0]:W0507 05:27:53.876000 123 torch/_inductor/utils.py:1679] Not enough SMs to use max_autotune_gemm mode
(EngineCore pid=123) INFO 05-07 05:28:05 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=123) INFO 05-07 05:28:35 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 42.33 s
(EngineCore pid=123) INFO 05-07 05:28:39 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/ffe48dbede4899d3ac3333b640e6f1582c61b4be5d494143ed6d3420e34b4a11/rank_0_0/model
(EngineCore pid=123) INFO 05-07 05:28:39 [monitor.py:48] torch.compile took 64.26 s in total
(EngineCore pid=123) INFO 05-07 05:28:41 [monitor.py:76] Initial profiling/warmup run took 2.08 s
(EngineCore pid=123) INFO 05-07 05:28:49 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=123) INFO 05-07 05:28:49 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=123) INFO 05-07 05:28:56 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 6.42 GiB total
(EngineCore pid=123) INFO 05-07 05:28:57 [gpu_worker.py:436] Available KV cache memory: 61.01 GiB
(EngineCore pid=123) INFO 05-07 05:28:57 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.8000 to 0.8523 to maintain the same effective KV cache size.
(EngineCore pid=123) INFO 05-07 05:28:57 [kv_cache_utils.py:1319] GPU KV cache size: 133,264 tokens
(EngineCore pid=123) INFO 05-07 05:28:57 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 5.46x
(EngineCore pid=123) 2026-05-07 05:29:04,456 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=123) 2026-05-07 05:29:17,049 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:14<00:00,  3.53it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.60it/s]
(EngineCore pid=123) INFO 05-07 05:29:46 [gpu_model_runner.py:6046] Graph capturing finished in 29 secs, took -0.05 GiB
(EngineCore pid=123) INFO 05-07 05:29:46 [gpu_worker.py:597] CUDA graph pool memory: -0.05 GiB (actual), 6.42 GiB (estimated), difference: 6.48 GiB (695395532800.0%).
(EngineCore pid=123) INFO 05-07 05:29:46 [core.py:283] init engine (profile, create kv cache, warmup model) took 156.01 seconds
(APIServer pid=1) INFO 05-07 05:29:47 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=1) INFO 05-07 05:29:47 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 05-07 05:29:47 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-07 05:29:57 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 05-07 05:30:20 [base.py:231] Multi-modal warmup completed in 22.693s
(APIServer pid=1) INFO 05-07 05:30:21 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 05-07 05:30:21 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Thanks.

Thanks for testing — unfortunately the same command still fails on my Thor with the assertion at kv_cache.py:80. Hoping you can spot what’s different about my host vs. the one you tested on.

Ran your exact command:

sudo docker run -it --rm --pull always
–runtime=nvidia --network host
-v $HOME/.cache/huggingface:/root/.cache/huggingface

vllm serve nvidia/Gemma-4-31B-IT-NVFP4
–gpu-memory-utilization 0.8
–enable-auto-tool-choice
–reasoning-parser gemma4
–tool-call-parser gemma4
Confirmed identical inputs:

Image digest: sha256:570f9a5ffa89a772226abcc98c2d358a56ec3f755c97bc079c7f2396ffe62260 (gemma4-jetson-thor, 4 weeks old, refreshed via --pull always)
Model: nvidia/Gemma-4-31B-IT-NVFP4 (downloaded fresh into the container’s /data/models/huggingface since that path is ephemeral on --rm)
Cleared the host page cache before launch (sysctl -w vm.drop_caches=3)
My host:

Jetson AGX Thor, L4T R38 (release), REVISION: 2.2, BOARD: generic (JetPack 7.x, kernel 6.8.12-tegra, Ubuntu 24.04.4)
NVIDIA driver 580.00 / CUDA 13.0 (per nvidia-smi)
Docker 29.1.3
Failure log (relevant section):

INFO default_loader.py:384 Loading weights took 7.34 seconds
WARNING kv_cache.py:94 Checkpoint does not provide a q scaling factor. Setting it to k_scale.
WARNING kv_cache.py:108 Using KV cache scaling factor 1.0 for fp8_e4m3.
ERROR core.py:1108 File “…/vllm/model_executor/layers/quantization/kv_cache.py”, line 80
assert layer.k_scale > 0.0
ERROR core.py:1108 AssertionError
Note that the warnings on lines 94 and 108 fire correctly (some layers’ scales fall back to 1.0), then the line-80 assert fires on a subsequent layer that has k_scale = 0 rather than None. So the loader’s “missing → default to 1.0” path works, but the “explicit 0 → assert” path does not — and our checkpoint apparently has at least one layer in that explicit-0 state on this host while yours does not.

Could you share:

Your host JetPack / L4T release and nvidia-smi driver version (so I can compare with mine)?
The HF snapshot revision of nvidia/Gemma-4-31B-IT-NVFP4 your container ended up downloading (visible in your container as /data/models/huggingface/hub/…/snapshots//)?
That would help narrow whether this is host-driver-dependent, an HF-cache races, or a checkpoint-revision drift between our two runs. Happy to attach the full log if useful.

Hi,

Sorry for the late update.
We test this with JetPack 7.1 (driver is 580.00).

The model is not captured in our device locally (not in ${HOME}/vllm_cache somehow).
But we deploy the model again and it still works.

sudo docker run -it --rm --pull always --runtime=nvidia -v ${HOME}/vllm_cache:/root/.cache --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --max-num-batched-tokens 8192 --host 0.0.0.0 --port 8000 --trust-remote-code
...
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Have you tried to delete the cache and re-download it?
Thanks.