I’m running the Hermes model in the vLLM container. This used to work in version 25.09-py3. But now fails in version 25.10-py3.
The docker command is
vllm:
deploy:
resources:
limits:
memory: 100G
image: nvcr.io/nvidia/vllm:25.10-py3
ports:
- 8000:8000
privileged: true
gpus: all
shm_size: 8gb
ipc: host
ulimits:
memlock: 1
stack: 67108864
env_file:
- .env
volumes:
- $HOME/.cache:/root/.cache
- $PWD:/workspace
entrypoint:
[
“vllm”, “serve”,
“–gpu-memory-utilization=0.7”,
“–max_model_len=5000”,
“–host=0.0.0.0”, “–port=8000”,
“NousResearch/Hermes-4-70B-FP8”
]
Some messages in the startup log:
Repeatedly: Trying to use TMA Descriptor Prefetch without CUTE_ARCH_TMA_SM90_ENABLED.
And then torch.AcceleratorError: CUDA error: unspecified launch failure
See the attached file for a full log of the container start
vllm-exception.txt (50.3 KB)
A