vLLM container 25.10-py3 fails to start

I’m running the Hermes model in the vLLM container. This used to work in version 25.09-py3. But now fails in version 25.10-py3.

The docker command is

vllm:
deploy:
resources:
limits:
memory: 100G
image: nvcr.io/nvidia/vllm:25.10-py3
ports:
- 8000:8000
privileged: true
gpus: all
shm_size: 8gb
ipc: host
ulimits:
memlock: 1
stack: 67108864
env_file:
- .env
volumes:
- $HOME/.cache:/root/.cache
- $PWD:/workspace
entrypoint:
[
“vllm”, “serve”,
“–gpu-memory-utilization=0.7”,
“–max_model_len=5000”,
“–host=0.0.0.0”, “–port=8000”,
“NousResearch/Hermes-4-70B-FP8”
]

Some messages in the startup log:
Repeatedly: Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
And then torch.AcceleratorError: CUDA error: unspecified launch failure

See the attached file for a full log of the container start

vllm-exception.txt (50.3 KB)

A

Some extra information:
I did delete /root/.cache/vllm/torch_compile_cache/ but the problem persists.

Changing the version back to 25.09-py3 it starts immediately.

Hi,

We can launch nvcr.io/nvidia/vllm:25.10-py3 without any error.
Could you remove the clear the $HOME/.cache and try it again?

$ sudo docker run -it --rm nvcr.io/nvidia/vllm:25.10-py3

==========
== vLLM ==
==========

NVIDIA Release 25.10 (build 224204848)
vLLM Version 0.10.2+9dd9ca32
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for vLLM.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@f1bd4f0d4cf7:/workspace# python3 -m vllm.entrypoints.openai.api_server --model nvidia/Llama-3.1-8B-Instruct-FP8 --trust-remote-code --tensor-parallel-size 1 --max-model-len 1024 --gpu-memory-utilization 0.85
INFO 11-03 03:39:59 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=157) INFO 11-03 03:40:00 [api_server.py:1919] vLLM API server version 0.10.2+9dd9ca32.nv25.10
(APIServer pid=157) INFO 11-03 03:40:00 [utils.py:328] non-default args: {'model': 'nvidia/Llama-3.1-8B-Instruct-FP8', 'trust_remote_code': True, 'max_model_len': 1024, 'gpu_memory_utilization': 0.85}
(APIServer pid=157) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 879/879 [00:00<00:00, 3.72MB/s]
hf_quant_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 241/241 [00:00<00:00, 1.72MB/s]
(APIServer pid=157) INFO 11-03 03:40:10 [__init__.py:742] Resolved architecture: LlamaForCausalLM
(APIServer pid=157) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=157) INFO 11-03 03:40:10 [__init__.py:1815] Using max model len 1024
(APIServer pid=157) WARNING 11-03 03:40:11 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=157) INFO 11-03 03:40:14 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=157) WARNING 11-03 03:40:14 [modelopt.py:71] Detected ModelOpt fp8 checkpoint. Please note that the format is experimental and could change.
tokenizer_config.json: 50.9kB [00:00, 44.8MB/s]
tokenizer.json: 9.09MB [00:00, 16.2MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 325/325 [00:00<00:00, 1.57MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 189/189 [00:00<00:00, 913kB/s]
INFO 11-03 03:40:25 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=226) INFO 11-03 03:40:27 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=226) INFO 11-03 03:40:27 [core.py:76] Initializing a V1 LLM engine (v0.10.2+9dd9ca32.nv25.10) with config: model='nvidia/Llama-3.1-8B-Instruct-FP8', speculative_config=None, tokenizer='nvidia/Llama-3.1-8B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=nvidia/Llama-3.1-8B-Instruct-FP8, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[W1103 03:40:45.697080431 ProcessGroupNCCL.cpp:936] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=226) INFO 11-03 03:40:45 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=226) INFO 11-03 03:40:45 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=226) INFO 11-03 03:40:45 [gpu_model_runner.py:2338] Starting to load model nvidia/Llama-3.1-8B-Instruct-FP8...
(EngineCore_DP0 pid=226) INFO 11-03 03:40:45 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=226) INFO 11-03 03:40:46 [cuda.py:362] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=226) INFO 11-03 03:40:47 [weight_utils.py:348] Using model weights format ['*.safetensors']
model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:43<00:00, 48.5MB/s]
model-00002-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████| 4.08G/4.08G [01:44<00:00, 39.0MB/s]
(EngineCore_DP0 pid=226) INFO 11-03 03:42:33 [weight_utils.py:369] Time spent downloading weights for nvidia/Llama-3.1-8B-Instruct-FP8: 106.525137 seconds9MB/s]
model.safetensors.index.json: 68.1kB [00:00, 91.1MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.03it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.19it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.16it/s]
(EngineCore_DP0 pid=226) 
(EngineCore_DP0 pid=226) INFO 11-03 03:42:36 [default_loader.py:268] Loading weights took 1.87 seconds
(EngineCore_DP0 pid=226) INFO 11-03 03:42:37 [gpu_model_runner.py:2392] Model loading took 8.4890 GiB and 110.664644 seconds
(EngineCore_DP0 pid=226) INFO 11-03 03:42:43 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/d77cdd90d8/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=226) INFO 11-03 03:42:43 [backends.py:550] Dynamo bytecode transform time: 6.49 s
(EngineCore_DP0 pid=226) INFO 11-03 03:42:50 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=226) INFO 11-03 03:43:01 [backends.py:215] Compiling a graph for dynamic shape takes 17.14 s
(EngineCore_DP0 pid=226) INFO 11-03 03:44:41 [monitor.py:34] torch.compile takes 23.63 s in total
(EngineCore_DP0 pid=226) INFO 11-03 03:45:39 [gpu_worker.py:298] Available KV cache memory: 92.80 GiB
(EngineCore_DP0 pid=226) INFO 11-03 03:45:40 [kv_cache_utils.py:864] GPU KV cache size: 760,176 tokens
(EngineCore_DP0 pid=226) INFO 11-03 03:45:40 [kv_cache_utils.py:868] Maximum concurrency for 1,024 tokens per request: 742.36x
(EngineCore_DP0 pid=226) 2025-11-03 03:45:46,113 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=226) 2025-11-03 03:46:02,546 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████| 67/67 [00:05<00:00, 12.31it/s]
(EngineCore_DP0 pid=226) INFO 11-03 03:46:08 [gpu_model_runner.py:3118] Graph capturing finished in 6 secs, took 0.75 GiB
(EngineCore_DP0 pid=226) INFO 11-03 03:46:08 [gpu_worker.py:391] Free memory on device (116.73/122.82 GiB) on startup. Desired GPU memory utilization is (0.85, 104.4 GiB). Actual usage is 8.49 GiB for weight, 1.02 GiB for peak activation, 2.1 GiB for non-torch memory, and 0.75 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=98679709388` to fit into requested memory, or `--kv-cache-memory=111919078912` to fully utilize gpu memory. Current kv cache memory in use is 99638353612 bytes.
(EngineCore_DP0 pid=226) INFO 11-03 03:46:09 [core.py:218] init engine (profile, create kv cache, warmup model) took 211.92 seconds
(APIServer pid=157) INFO 11-03 03:46:10 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 47511
(APIServer pid=157) INFO 11-03 03:46:10 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=157) INFO 11-03 03:46:11 [api_server.py:1715] Supported_tasks: ['generate']
(APIServer pid=157) WARNING 11-03 03:46:11 [__init__.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=157) INFO 11-03 03:46:11 [serving_responses.py:130] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
(APIServer pid=157) INFO 11-03 03:46:12 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
(APIServer pid=157) INFO 11-03 03:46:12 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
(APIServer pid=157) INFO 11-03 03:46:12 [api_server.py:1994] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:36] Available routes are:
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=157) INFO 11-03 03:46:12 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=157) INFO:     Started server process [157]
(APIServer pid=157) INFO:     Waiting for application startup.
(APIServer pid=157) INFO:     Application startup complete.

Thanks.

@AastaLLL I notice that you tested with another model. Maybe the issue is specifically with that (type of) model? I myself don’t have the problem with another model.
Can you test with that model?

I did try again with a completely empty $HOME/.cache
Of course It took a long time to download the model again.
Btrw I was surprised there was no download progress indication after the line Using model weights format ['*.safetensors']
Even though I noticed the network traffic and the huggingface cache directory being updated.

It looks like the same error. I’ll upload log file today when I get at the Jetson Thor.

Yes, same error with a new $Home/.cache directory. See attachment.

When I switch back to 25.09-py3 it does start without errors. So I do think there’s an issue.
Can you test with that same Hermes model?

vllm-exception-2.txt (50.6 KB)

Hi,

Yes, we are testing the model internally.
Will update the status from our side later.

Thanks.

Hi,

We didn’t meet the issue with NousResearch/Hermes-4-70B-FP8.
Is it possible that the issue is related to the APIs?
Would you mind giving our command a try?

root@f0d13212f3c1:/workspace# python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port 6678 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32k \
  --cuda-graph-sizes 4 \
  --max_num_seqs 4 \
  --served-model-name NousResearch/Hermes-4-70B-FP8
INFO 11-06 09:51:01 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=157) INFO 11-06 09:51:02 [api_server.py:1919] vLLM API server version 0.10.2+9dd9ca32.nv25.10
(APIServer pid=157) INFO 11-06 09:51:02 [utils.py:328] non-default args: {'host': '0.0.0.0', 'port': 6678, 'max_model_len': 32000, 'served_model_name': ['NousResearch/Hermes-4-70B-FP8'], 'gpu_memory_utilization': 0.85, 'max_num_seqs': 4, 'cuda_graph_sizes': [4]}
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 726/726 [00:00<00:00, 6.20MB/s]
(APIServer pid=157) INFO 11-06 09:51:10 [__init__.py:742] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=157) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=157) INFO 11-06 09:51:10 [__init__.py:1815] Using max model len 32000
(APIServer pid=157) INFO 11-06 09:51:12 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 9.73kB [00:00, 20.8MB/s]
vocab.json: 2.78MB [00:00, 7.99MB/s]
merges.txt: 1.67MB [00:00, 25.8MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 6.71MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 1.68MB/s]
INFO 11-06 09:51:26 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=244) INFO 11-06 09:51:27 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=244) INFO 11-06 09:51:27 [core.py:76] Initializing a V1 LLM engine (v0.10.2+9dd9ca32.nv25.10) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=NousResearch/Hermes-4-70B-FP8, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":4,"local_cache_dir":null}
[W1106 09:51:51.364940700 ProcessGroupNCCL.cpp:936] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=244) INFO 11-06 09:51:51 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=244) INFO 11-06 09:51:51 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=244) INFO 11-06 09:51:51 [gpu_model_runner.py:2338] Starting to load model Qwen/Qwen3-0.6B...
(EngineCore_DP0 pid=244) INFO 11-06 09:51:51 [gpu_model_runner.py:2370] Loading model from scratch...
(EngineCore_DP0 pid=244) INFO 11-06 09:51:51 [cuda.py:362] Using Flash Attention backend on V1 engine.
(EngineCore_DP0 pid=244) INFO 11-06 09:51:52 [weight_utils.py:348] Using model weights format ['*.safetensors']
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1.50G/1.50G [00:16<00:00, 91.4MB/s]
(EngineCore_DP0 pid=244) INFO 11-06 09:52:09 [weight_utils.py:369] Time spent downloading weights for Qwen/Qwen3-0.6B: 17.446960 seconds
(EngineCore_DP0 pid=244) INFO 11-06 09:52:10 [weight_utils.py:406] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.20s/it]
(EngineCore_DP0 pid=244) 
(EngineCore_DP0 pid=244) INFO 11-06 09:52:12 [default_loader.py:268] Loading weights took 2.23 seconds
(EngineCore_DP0 pid=244) INFO 11-06 09:52:12 [gpu_model_runner.py:2392] Model loading took 1.1201 GiB and 20.854677 seconds
(EngineCore_DP0 pid=244) INFO 11-06 09:52:17 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/715e679d97/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=244) INFO 11-06 09:52:17 [backends.py:550] Dynamo bytecode transform time: 4.60 s
(EngineCore_DP0 pid=244) [rank0]:W1106 09:52:18.696000 244 torch/_inductor/utils.py:1554] [0/0] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=244) INFO 11-06 09:52:23 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=244) INFO 11-06 09:52:37 [backends.py:215] Compiling a graph for dynamic shape takes 19.53 s
(EngineCore_DP0 pid=244) INFO 11-06 09:52:40 [monitor.py:34] torch.compile takes 24.13 s in total
(EngineCore_DP0 pid=244) INFO 11-06 09:53:39 [gpu_worker.py:298] Available KV cache memory: 101.12 GiB
(EngineCore_DP0 pid=244) INFO 11-06 09:53:39 [kv_cache_utils.py:864] GPU KV cache size: 946,752 tokens
(EngineCore_DP0 pid=244) INFO 11-06 09:53:39 [kv_cache_utils.py:868] Maximum concurrency for 32,000 tokens per request: 29.59x
(EngineCore_DP0 pid=244) 2025-11-06 09:53:42,231 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=244) 2025-11-06 09:53:42,284 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 22.59it/s]
(EngineCore_DP0 pid=244) INFO 11-06 09:53:43 [gpu_model_runner.py:3118] Graph capturing finished in 1 secs, took 0.19 GiB
(EngineCore_DP0 pid=244) INFO 11-06 09:53:43 [gpu_worker.py:391] Free memory on device (117.92/122.82 GiB) on startup. Desired GPU memory utilization is (0.85, 104.4 GiB). Actual usage is 1.12 GiB for weight, 0.3 GiB for peak activation, 1.86 GiB for non-torch memory, and 0.19 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=108218533273` to fit into requested memory, or `--kv-cache-memory=122736089088` to fully utilize gpu memory. Current kv cache memory in use is 108582073753 bytes.
(EngineCore_DP0 pid=244) INFO 11-06 09:53:43 [core.py:218] init engine (profile, create kv cache, warmup model) took 90.43 seconds
(APIServer pid=157) INFO 11-06 09:53:44 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 59172
(APIServer pid=157) INFO 11-06 09:53:44 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=157) INFO 11-06 09:53:45 [api_server.py:1715] Supported_tasks: ['generate']
(APIServer pid=157) WARNING 11-06 09:53:45 [__init__.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=157) INFO 11-06 09:53:45 [serving_responses.py:130] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=157) INFO 11-06 09:53:46 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=157) INFO 11-06 09:53:46 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
(APIServer pid=157) INFO 11-06 09:53:46 [api_server.py:1994] Starting vLLM API server 0 on http://0.0.0.0:6678
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:36] Available routes are:
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /docs, Methods: HEAD, GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=157) INFO 11-06 09:53:46 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=157) INFO:     Started server process [157]
(APIServer pid=157) INFO:     Waiting for application startup.
(APIServer pid=157) INFO:     Application startup complete.

Thanks.

I will.
Which docker command did you use? sudo docker run -it --rm ``nvcr.io/nvidia/vllm:25.10-py3 ?

@AastaLLL when I look at the your log I notice that it says Starting to load model Qwen/Qwen3-0.6B... and that there is 1 safetensor checkpoint shard. And Model loading took 1.1201 GiB and 20.854677 secondsWhich is much smaller than I’d expect for a 70B model.
When I run the Hermes model using vllm 20.09-py3 I see a log message Starting to load model NousResearch/Hermes-4-70B-FP8… and there are 15 checkpoint shards. And Model loading took 67.7231 GiB and 28.921133 secondsWhich is as expected from a 70B model.
Looking at the documentation I see that the default model is indeed Qwen/Qwen3-0.6B. And that --served-model-name just configures what it should call itself in the openai API.
Can you test again with --model NousResearch/Hermes-4-70B-FP8 instead of --served-model-name NousResearch/Hermes-4-70B-FP8 ?

Hi,

Thanks for pointing this out. Didn’t notice that when we tested it.
We are testing this again with --model NousResearch/Hermes-4-70B-FP8 and will update more information with you soon.

Thanks.

Hi,

Thanks for your patience.

Confirmed that we can see the same error with Hermes-4-70B-FP8 on vllm 25.10 container.
We are checking with the internal team to gather more information.

Thanks.

1 Like