Hello,
I’m using vLLM on a dual-node DGX Spark setup. vLLM worked very well up to version 0.14.0, but starting from version 0.15.0, the model either crashes on the very first inference after loading, or it hangs without producing any output on the first inference.
When building the Docker image for vLLM 0.15.0, I’m using PyTorch 2.10.0. Is it possible that vLLM does not support PyTorch 2.10.0 yet?
Balaxxe
February 10, 2026, 4:47am
2
I made a bug report here:
opened 12:35AM - 10 Feb 26 UTC
**Description:**
Distributed TP inference with gpt-oss-120b hangs after 1-2 req… uests on the new 2026-02-09 build (PyTorch 2.10 + Triton 3.6.0). The hang is specific to FULL CUDA graph capture mode.
**Steps to reproduce:**
```bash
./launch-cluster.sh -t vllm-node \
-n <head_ip>,<worker_ip> \
--ib-if <head_ib>,<worker_ib> \
exec vllm serve openai/gpt-oss-120b \
--port 8000 --host 0.0.0.0 \
--gpu-memory-utilization 0.70 \
-tp 2 --distributed-executor-backend ray \
--max-model-len 131072 \
--no-enable-prefix-caching \
--max-num-batched-tokens 8192
```
Send 2-3 chat completion requests. The first 1-2 succeed, then the model freezes mid-generation on a subsequent request. Throughput drops to 0 with 1 req still showing as running. After 300s it crashes with `RayChannelTimeoutError`.
**What I tested:**
- `--enforce-eager` → stable
- `--compilation-config '{"cudagraph_mode":"none"}'` → stable
- `--compilation-config '{"cudagraph_mode":"piecewise"}'` → stable
- `--compilation-config '{"cudagraph_mode":"full"}'` → hangs after 1-2 requests
- Default (`FULL_AND_PIECEWISE`) → hangs after 1-2 requests
The FULL portion of CUDA graph capture is what breaks it. Piecewise alone is fine.
**Environment:**
- 2x DGX Spark (GB10, sm_12x, aarch64)
- TP=2 across nodes via InfiniBand
- DGX OS 7.4.0, kernel 6.17.0-1008-nvidia
- vllm-node image built 2026-02-09 (PyTorch 2.10, Triton 3.6.0)
- vLLM 0.15.2rc1.dev135+g285bab475
**Previous build behavior:**
This did not happen on the previous build. The only changes in the 2026-02-09 build were PyTorch 2.10, Triton 3.6.0, and removal of the fastsafetensors patch.
**Workaround:**
`--compilation-config '{"cudagraph_mode":"piecewise"}'`
---
It’s something with vLLM or pytorch.
Have not had a chance to fully debug but I do provide a temp workaround.
Will get to it soon.
eugr
February 10, 2026, 7:06am
3
I replied in the ticket, but have you used a wheels build by the chance?
I noticed all kinds of weird issues after the latest pytorch 2.10 migration. Should work fine if compiled from source (without --use-wheels flag).
I’m going to disable wheels builds for now.