Vllm on spark cluster starts and loads model but API not running?

stierma1 · October 19, 2025, 1:05am

I am following the direction on build.nvidia for running vllm across 2 dgx sparks. I am able to get the model to load and ray status says the gpu’s are reserved, but after loading, vllm serve just hangs and I am unable to curl any of the APIs localhost:8000. It returns curl: (7) Failed to connect to localhost port 8000 after 0 ms: Couldn’t connect to server. I have tried within the host container and tried outside. I have added --host to the vllm serve command. I also added -p 8000:8000 to the run_script.sh used in the tutorial. Need help.

cosinus · October 19, 2025, 8:15am

Could you paste the script you used and the logfile(s) in here? That might help us to help you.

stierma1 · October 19, 2025, 11:05am

Here is the log from the host. After it loads the model it does not produce anymore logs.

INFO 10-19 01:02:45 [__init__.py:241] Automatically detected platform cuda.

(APIServer pid=981) INFO 10-19 01:02:46 [api_server.py:1805] vLLM API server version 0.10.1.1+381074ae.nv25.09

(APIServer pid=981) INFO 10-19 01:02:46 [utils.py:326] non-default args: {'model_tag': 'empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', 'host': '0.0.0.0', 'model': 'empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', 'max_model_len': 8000, 'tensor_parallel_size': 2}

(APIServer pid=981) INFO 10-19 01:02:51 [__init__.py:711] Resolved architecture: LlamaForCausalLM

(APIServer pid=981) INFO 10-19 01:02:51 [__init__.py:1750] Using max model len 8000

(APIServer pid=981) INFO 10-19 01:02:51 [gptq_marlin.py:170] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.

(APIServer pid=981) INFO 10-19 01:02:52 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.

  import pynvml  # type: ignore[import]

INFO 10-19 01:02:55 [__init__.py:241] Automatically detected platform cuda.

(EngineCore_0 pid=1089) INFO 10-19 01:02:58 [core.py:636] Waiting for init message from front-end.

(EngineCore_0 pid=1089) INFO 10-19 01:02:58 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1+381074ae.nv25.09) with config: model='empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', speculative_config=None, tokenizer='empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}

(EngineCore_0 pid=1089) 2025-10-19 01:02:58,302 INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.0.0.73:6379...

(EngineCore_0 pid=1089) 2025-10-19 01:02:58,309 INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 

(EngineCore_0 pid=1089) INFO 10-19 01:02:59 [ray_utils.py:339] No current placement group found. Creating a new placement group.

(EngineCore_0 pid=1089) WARNING 10-19 01:02:59 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 4104c5164729d07b8b23203cf9b1864dbc5c337c02e16f2fd7cafc3e. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.

(EngineCore_0 pid=1089) WARNING 10-19 01:02:59 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node c91702895b966640dd56518e37f3600dad9b471a3609335fd89f1d4d. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.

(EngineCore_0 pid=1089) INFO 10-19 01:02:59 [ray_distributed_executor.py:169] use_ray_spmd_worker: True

(EngineCore_0 pid=1089) (pid=1193) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.

(EngineCore_0 pid=1089) (pid=1193)   import pynvml  # type: ignore[import]

(EngineCore_0 pid=1089) (pid=1193) INFO 10-19 01:03:01 [__init__.py:241] Automatically detected platform cuda.

(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:63] RAY_NON_CARRY_OVER_ENV_VARS from config: set()

(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:65] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_V1', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_RAY_COMPILED_DAG', 'MAX_JOBS', 'VLLM_USE_RAY_SPMD_WORKER', 'CUDA_HOME']

(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:68] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) [W1019 01:03:04.359149134 ProcessGroupNCCL.cpp:927] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [__init__.py:1418] Found nccl from library libnccl.so.2

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [pynccl.py:70] vLLM is using nccl==2.27.7

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) WARNING 10-19 01:03:04 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[], buffer_handle=None, local_subscribe_addr=None, remote_subscribe_addr='tcp://10.0.0.73:60481', remote_addr_ipv6=False)

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [gpu_model_runner.py:1985] Loading model from scratch...

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [cuda.py:328] Using Flash Attention backend on V1 engine.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:06 [weight_utils.py:296] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]

(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.

(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92)   import pynvml  # type: ignore[import]

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:06 [weight_utils.py:349] No model.safetensors.index.json found in remote.

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:39<00:00, 39.58s/it]

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:39<00:00, 39.58s/it]

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) 

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [W1019 01:03:04.422445362 ProcessGroupNCCL.cpp:927] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:47 [default_loader.py:262] Loading weights took 40.55 seconds

(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92) INFO 10-19 01:03:02 [__init__.py:241] Automatically detected platform cuda.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see 
 for more options.)

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [__init__.py:1418] Found nccl from library libnccl.so.2

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [pynccl.py:70] vLLM is using nccl==2.27.7

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) WARNING 10-19 01:03:04 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [gpu_model_runner.py:1985] Loading model from scratch...

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [cuda.py:328] Using Flash Attention backend on V1 engine.

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:06 [weight_utils.py:296] Using model weights format ['*.safetensors']

(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:48 [gpu_model_runner.py:2007] Model loading took 18.6364 GiB and 42.791994 seconds

cosinus · October 19, 2025, 3:03pm

Do you use containers? 0.10.1.1+381074ae.nv25.09 looks like NVIDIA packaged version. Which should be ok.

Is this the script you are using?

Never tried myself to connect to nodes - I used only multiple GPUs in one instance yet. No fast ethernet connections to play with.

But I would expect some kind of error. Getting stuck after the weights have loaded is odd without any errors.

So 10.0.0.73 is your “head” node? And 10.0.0.92 the worker?

Anything to see on the second node? Anything to see on the Ray Cluster Dashboard (http://127.0.0.1:8265) what is mentioned in the log? I would expect some kind of handshake/sync issue.

stierma1 · October 19, 2025, 4:44pm

stierma1 · October 19, 2025, 4:49pm

This is the log from the worker. Looks similar.

:job_id:01000000

INFO 10-19 16:33:58 [__init__.py:241] Automatically detected platform cuda.

:actor_name:RayWorkerWrapper

[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

INFO 10-19 16:34:01 [__init__.py:1418] Found nccl from library libnccl.so.2

INFO 10-19 16:34:01 [pynccl.py:70] vLLM is using nccl==2.27.7

WARNING 10-19 16:34:02 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0

[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0

[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1

INFO 10-19 16:34:02 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1

INFO 10-19 16:34:02 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.

INFO 10-19 16:34:02 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...

INFO 10-19 16:34:02 [gpu_model_runner.py:1985] Loading model from scratch...

INFO 10-19 16:34:02 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod

INFO 10-19 16:34:02 [cuda.py:328] Using Flash Attention backend on V1 engine.

INFO 10-19 16:34:03 [weight_utils.py:296] Using model weights format ['*.safetensors']

stierma1 · October 19, 2025, 4:54pm

Oh wow it looks like it was processing something! It took about 15 minutes but it started posting new logs and now the server is up and running. I guess I just needed to wait…

(EngineCore_0 pid=1164) (RayWorkerWrapper pid=315, ip=10.0.0.92) INFO 10-19 16:34:03 [weight_utils.py:296] Using model weights format ['*.safetensors']
(EngineCore_0 pid=1164) (RayWorkerWrapper pid=1273) INFO 10-19 16:35:11 [gpu_model_runner.py:2007] Model loading took 18.6364 GiB and 68.875958 seconds
(EngineCore_0 pid=1164) (RayWorkerWrapper pid=315, ip=10.0.0.92) INFO 10-19 16:49:11 [weight_utils.py:312] Time spent downloading weights for empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit: 907.372217 seconds

cosinus · October 19, 2025, 5:37pm

😅 Nice. As there where no errors, I hoped to see more in the dashboard. I expected that it was trying to do something. The recent job looks like it has (or had) a spinning wheel in front.

Does it now show a deployment after everything loaded?

Glad to hear that it only took some more time. So if I will ever have the opportunity to get two boxes networked together I will start with a smaller model first to see whether it does work.

stierma1 · October 19, 2025, 5:43pm

Yes it logs that the server is running and I was able to connect to it via Roo Code. So it appears all is good.

raphael.amorim · December 1, 2025, 12:27pm

Look into this thread. There’s a ton of information and insights and code about using vLLM and Ray on stacked sparks, including some improvements on loading times:

Topic		Replies	Views
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4944	December 9, 2025
vLLM on dual sparks DGX Spark / GB10	4	737	December 1, 2025
Who wants to be the hero and help a total newbie! Got a spark and um, yeah DGX Spark / GB10 nemotron	8	494	April 17, 2026
Running Qwen/Qwen3.5-35B-A3B-FP8 on a cluster DGX Spark / GB10	19	325	March 21, 2026
With two Sparks, vLLM 0.18.1rc0 still hammering two cores at 100% when idle DGX Spark / GB10	7	227	March 28, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1236	February 13, 2026
Failed to run tp or pp on two-nodes ray cluster using docker vllm:25.11 DGX Spark / GB10	2	105	December 1, 2025
Llama.cpp rpc on dgx spark DGX Spark / GB10 llama	4	357	March 1, 2026
How do I run vLLM inference on a DGX Spark system using two ConnectX-7 NICs? DGX Spark / GB10	10	1244	December 22, 2025
Issue with connection to 2 dgx sparks. vllm DGX Spark / GB10	4	201	November 30, 2025

Vllm on spark cluster starts and loads model but API not running?

Related topics