Hardware Platform (GPU model and numbers)
- GPU Model: NVIDIA Blackwell Pro 6000
- GPU Count: 1
System Memory
- RAM: 256 GB
Ubuntu Version
- Version: 24.04
NVIDIA GPU Driver Version (valid for GPU only)
- 580.65.06
Issue Type
- Type: Bug
Issue Description
When attempting to launch a vLLM (version 0.10.2) container using a Hugging Face model, the container fails to initialize the engine core.
This issue occurs specifically with the model nvidia/Cosmos-Reason1-7B.
Interestingly, the same setup works successfully on an ADA6000 GPU, but fails on the Blackwell Pro 6000 GPU, suggesting a potential compatibility issue between this model and the newer GPU architecture.
The same behavior and error have also been reported in GitHub Issue #65.
Below is the exact error message observed in the container logs:
(EngineCore_0 pid=6357)
CUDA error (/__w/xformers/xformers/third_party/flash-attention/hopper/flash_fwd_launch_template.h:188): invalid argument
Traceback (most recent call last):
...
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
How To Reproduce
1-) Run the following command
# Deploy with docker on Linux:
docker run --runtime nvidia --gpus all \
--name my_vllm_container \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model nvidia/Cosmos-Reason1-7B
2-) Observe that the container exits with the CUDA invalid argument error shown above.
Expected Behavior
The vLLM container should successfully load the Cosmos-Reason1-7B model and start serving inference normally.**