not naive… incomplete ;)
I am a beginner. This may not be optimal, but it works.
-–
## STEP 1: Create virtual environment
>>>
python3 -m venv ~/venv/vllm-cu130
source ~/venv/vllm-cu130/bin/activate
>>>
# Use venv to avoid system Python / CUDA conflicts
-–
## STEP 2: Check GPU and architecture
>>>
nvidia-smi
uname -m
>>>
# Expected:
# - CUDA Version: 13.0
# - GPU: NVIDIA GB10
# - Architecture: aarch64
# GB10 compute capability is 12.1 (important later)
-–
## STEP 3: Install vLLM wheel built for CUDA 13
<<<
pip install -U vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl
>>>
>>>
python - <<‘PY’
import vllm
print(vllm._version_)
PY
>>>
# vLLM installs successfully, but PyTorch is NOT correct yet
-–
## STEP 4: Reinstall PyTorch with CUDA 13 support
>>>
pip uninstall -y torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url (PyTorch cu130 index)
>>>
# vLLM-installed torch may not work with CUDA 13
# Must manually install cu130 PyTorch
-–
## STEP 5: Verify PyTorch CUDA works
>>>
python - <<‘PY’
import torch
print(torch._version_)
print(torch.version.cuda)
print(torch.cuda.is_available())
print(torch.cuda.get_device_capability(0))
PY
>>>
# Expect cuda.is_available() == True
# Capability should be (12, 1)
-–
## STEP 6: Prepare Qwen3-32B-Base model (safetensors)
>>>
export MODEL_DIR=“/home/ziyao/Desktop/AI/Qwen332B”
test -f “$MODEL_DIR/config.json” && echo OK
>>>
>>>
python - <<‘PY’
from transformers import AutoConfig
AutoConfig.from_pretrained(“/home/ziyao/Desktop/AI/Qwen332B”, trust_remote_code=True)
print(“config load OK”)
PY
>>>
# Model is Base version, not instruct
-–
## STEP 7: Problem when starting vLLM normally
>>>
Value ‘sm_121a’ is not defined for option ‘gpu-name’
>>>
# Root cause:
# - GB10 compute capability = 12.1
# - torch.compile / Triton generates sm_121a
# - Current toolchain does not fully support it
-–
## STEP 8: Workaround - enforce eager mode
>>>
CUDA_VISIBLE_DEVICES=0 vllm serve /home/ziyao/Desktop/AI/Qwen332B \
--host 0.0.0.0 \
--port 8000 \
--dtype float16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--enforce-eager \
--served-model-name qwen3-local
>>>
# --enforce-eager disables torch.compile / inductor / Triton
# Lower performance, but stable and works
-–
## STEP 9: Verify API is running
>>>
curl http://127.0.0.1:8000/v1/models
>>>
# If “qwen3-local” appears, server is working
-–
## Conclusion
This setup works on GB10 with CUDA 13.
Performance is not optimal, but stability is good.
If there is a better solution without --enforce-eager,
please let me know.
pytorch link: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
sorry for missing that because content is translated by AI
Go here: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
Forum member @eugr has done a ton of work to make this straightforward.
Thanks for sharing that Docker command.
For some reason, when I try to use this for Qwen3-Coder, GML-4.5-Air, IQuest-Coder (remote code seems to be finnicky), Devstral-Small-2, etc. models, they don’t seem to work.
Do you know if there is a list of compatible models besides vLLM for Inference | DGX Spark ? (This seems to be outdated)
For example, I’m trying to run sudo docker run --runtime nvidia --gpus all -p 8000:8000 --ipc=host --platform “linux/arm64” vllm/vllm-openai:nightly --model Firworks/Qwen3-Coder-30B-A3B-Instruct-nvfp4 --dtype auto --max-model-len 32768 (Qwen/Qwen3-Coder-30B-A3B-Instruct also doesn’t work, of course)
I also tried the following (as well as their originals) but no luck:
Firworks/IQuest-Coder-V1-40B-Instruct-nvfp4
Firworks/Devstral-Small-2-24B-Instruct-2512-nvfp4
naver-hyperclovax/HyperCLOVAX-SEED-Think-32B
Firworks/Solar-Open-100B-nvfp4
Firworks/GLM-4.5-Air-Derestricted-nvfp4
etc.
:(
Just build a Docker image from our community repository: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks
Or use pre-built vLLM wheels on a host system. vLLM official docker doesn’t work on Spark.
Thanks!
It says there that you added support for one model, but I can’t seem to find your list of compatible models like the matrix I linked by Nvidia. Is there one?
All models that are compatible with current vLLM should be able to run in this Docker too. As long as they fit in memory. I’ll post a list of models that I personally tested so far a bit later.
Not a complete list, but ones that I have in my notes:
- Qwen/Qwen3-VL-4B-Instruct-FP8
- Qwen/Qwen3-VL-30B-A3B-Instruct-FP8
- Qwen/Qwen3-VL-32B-Instruct-FP8
- Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
- Qwen/Qwen3-Next-80B-A3B-Thinking-FP8
- openai/gpt-oss-120b
- RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4
- QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ
- cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit
- QuantTrio/MiniMax-M2-AWQ
- QuantTrio/GLM-4.6-AWQ
- zai-org/GLM-4.6V-FP8
- cyankiwi/GLM-4.6V-AWQ-4bit
- QuantTrio/GLM-4.7-AWQ
- Salyut1/GLM-4.7-NVFP4
- lukealonso/MiniMax-M2.1-NVFP4
- RedHatAI/Qwen3-30B-A3B-NVFP4
- cpatonn/Qwen3-VL-30B-A3B-Instruct-AWQ-8bit
- cyankiwi/MiniMax-M2.1-AWQ-4bit
- QuantTrio/MiniMax-M2.1-AWQ
I tried Firworks/Solar-Open-100B-nvfp4 recently and it failed to load. I futzed with it a bit and gave up because the error messages were too obtuse. Coincidentally, today I tried to clean out some cruft in my hf cache folder and noticed that Firworks/Solar-Open-100B-nvfp4 was only partly downloaded. I’m downloading it again and if I get a complete download I will retry it and let you know if that was the problem.
OK, it didn’t work but now I remember the problem I ran into. [I’m using an earlier version of @eugr ‘s repository, so that may not help] The problem is the parameters I pass to vllm include some that are not known to the vllm system:
vllm serve Firworks/Solar-Open-100B-nvfp4 --dtype auto --max-model-len 32768 --trust-remote-code --enable-auto-tool-choice --tool-call-parser solar_open --reasoning-parser solar_open --logits-processors vllm.model_executor.models.parallel_tool_call_logits_processor:ParallelToolCallLogitsProcessor --logits-processors vllm.model_executor.models.solar_open_logits_processor:SolarOpenTemplateLogitsProcessor --port 8000 --host 0.0.0.0 --gpu-memory-utilization 0.7 -tp 2 --distributed-executor-backend ray --max-model-len 128000
The first one it doesn’t like is the tool-call-parser:
KeyError: 'invalid tool call parser: solar_open (chose from { deepseek_v3,deepseek_v31,deepseek_v32,ernie45,gigachat3,glm45,glm47,granite,granite-20b-fc,hermes,hunyuan_a13b,internlm,jamba,kimi_k2,llama3_json,llama4_json,llama4_pythonic,longcat,minimax,minimax_m2,mistral,olmo3,openai,phi4_mini_json,pythonic,qwen3_coder,qwen3_xml,seed_oss,step3,xlam })'
Where did I get the parameters? They are on the HuggingFace page for Solar Open 100B. But the code on that page recommends using their own Docker container upstage/Solar-Open-100B. I don’t think this works on Spark for various reasons.
So at this point my problem is how to add the tool-call-parser, reasoning-parser and logits-processors that Solar is expecting. Perhaps @eugr , my magic man, knows the answer, which may be as simple as installing his latest version of the custom vllm Docker container for Spark!
As a general rule, @eugr ‘s vllm custom code works very well on the Spark, in my experience.
[Note from the future]: I asked gpt-oss-120b about the problem and it says the vllm parsers are stored in vllm/vllm/tools/parsers/
What’s more, when I looked at the list of files provided in the hf download I see a bunch of python files that are exactly what I need. Since they weren’t found, I assume I need to download them and make them visible to the running vllm instance, but first I need to make sure they’re not full of calls to China or wherever. Hmm, there’s probably an appropriate way to do this; maybe wait for the vllm maintainers to add them?
It’s nice to have my Mac (and other more knowledgeable enthusiasts) take care of me there, but on the Spark I have to force myself to learn more things. Good and bad.
appreciate!!! i will try!
I’ve looked at the model, unfortunately instead of making their custom code as proper plugins, they modified vllm code directly. I tried to apply the patch to the current vllm build, but it’s not compatible anymore, as the changes were made to 0.12.x codebase. Without those changes their parser will not work with current vllm, even if you download it from HF.
The good news is that some of those changes, related to their reasoning parser, are waiting for the merge as a proper PR, but not all of those.
I can have another look and see if any of the failing changes can be modified to apply to the current codebase, but it looks like the devs are putting effort in baking the support into mainline themselves, so if we wait a bit, it will be supported too.
Once again, Gene, thanks so much for your help. I want to start experimenting with tool use in LLMs and this obviously is leading into that. I am also learning (or relearning) that Nvidia’s newest toys take time to fit into the ecosystem of software that has not been tweaked for them.
I’m glad to hear that people are working on the necessary changes so I will wait for that.
Are you also concerned about what’s in these python scripts from the model developers from a security standpoint?
Yes, when the model comes with some extra executable code, I normally have a quick look to see if there is anything suspicious there, unless it’s a trusted publisher.
Thanks a lot!
the link did not work but I found it
docker pull isnob46/dgx-vllm-cu13:014rc2
Any word on 25.12.post1-py3 containing Vllm 0.13.x ?
I just ran the updated instructions in top-level post as of 1/16 and it seems something has changed, likely a small bump in some PyTorch dependencies which the specific vLLM wheel now tries to update.
Specifically, uv installs torch etc properly from the cu130 URL but then when the specific vLLM wheel is installed, it now overwrites torch with a non-CUDA variant. In the uv changelog for this step, I see:
- torch==2.9.1+cu130
+ torch==2.9.0
- torchaudio==2.9.1
+ torchaudio==2.9.0
- torchvision==0.24.1
+ torchvision==0.24.0
which… it’s odd that it downgraded like this. Maybe that wheel is hard pinned? Regardless, this is the root of the error which now occurs if the exact script is followed:
ImportError: libtorch_cuda.so: cannot open shared object file: No such file or directory
I ultimately resolved this by uv pip uninstall against vllm, torch, torchvision, torchaudio and then installed vllm again but with the modified command below, adding PyTorch’s CUDA 13 index as an –extra-index-url flag. I don’t love shoehorning in Torch’s repo but have done this to solve similar issues before. It may be more reliable moving forward…@johnny_nv could consider tweaking step 4 command to be as follows (it might also replace step 3 as a single shot, though untested):
uv pip install https://github.com/vllm-project/vllm/releases/download/v0.13.0/vllm-0.13.0+cu130-cp38-abi3-manylinux_2_35_aarch64.whl --extra-index-url https://download.pytorch.org/whl/cu130
This brought torch+cu130 back in and I am now able to run vLLM without errors.
Hi, sorry, I accidentally deleted the previous message. Anyway, for convenience, I built and pushed the image generated by @eugr’s magnificent work to DockerHub. The tag is:
isnob46/dgx-vllm-cu13:014
updated to the vllm 0.14.0 release that came out yesterday.
I also have a question: has anyone managed to quantize correctly using Nvidia’s modeloptimizer and then run it on vlm?
Helpful information, thanks for sharing. It can help us avoid mistakes while setting up vLLM.