I am following the direction on build.nvidia for running vllm across 2 dgx sparks. I am able to get the model to load and ray status says the gpu’s are reserved, but after loading, vllm serve just hangs and I am unable to curl any of the APIs localhost:8000. It returns curl: (7) Failed to connect to localhost port 8000 after 0 ms: Couldn’t connect to server. I have tried within the host container and tried outside. I have added --host to the vllm serve command. I also added -p 8000:8000 to the run_script.sh used in the tutorial. Need help.
Could you paste the script you used and the logfile(s) in here? That might help us to help you.
Here is the log from the host. After it loads the model it does not produce anymore logs.
INFO 10-19 01:02:45 [__init__.py:241] Automatically detected platform cuda.
(APIServer pid=981) INFO 10-19 01:02:46 [api_server.py:1805] vLLM API server version 0.10.1.1+381074ae.nv25.09
(APIServer pid=981) INFO 10-19 01:02:46 [utils.py:326] non-default args: {'model_tag': 'empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', 'host': '0.0.0.0', 'model': 'empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', 'max_model_len': 8000, 'tensor_parallel_size': 2}
(APIServer pid=981) INFO 10-19 01:02:51 [__init__.py:711] Resolved architecture: LlamaForCausalLM
(APIServer pid=981) INFO 10-19 01:02:51 [__init__.py:1750] Using max model len 8000
(APIServer pid=981) INFO 10-19 01:02:51 [gptq_marlin.py:170] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
(APIServer pid=981) INFO 10-19 01:02:52 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO 10-19 01:02:55 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=1089) INFO 10-19 01:02:58 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=1089) INFO 10-19 01:02:58 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1+381074ae.nv25.09) with config: model='empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', speculative_config=None, tokenizer='empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=1089) 2025-10-19 01:02:58,302 INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.0.0.73:6379...
(EngineCore_0 pid=1089) 2025-10-19 01:02:58,309 INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265
(EngineCore_0 pid=1089) INFO 10-19 01:02:59 [ray_utils.py:339] No current placement group found. Creating a new placement group.
(EngineCore_0 pid=1089) WARNING 10-19 01:02:59 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 4104c5164729d07b8b23203cf9b1864dbc5c337c02e16f2fd7cafc3e. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_0 pid=1089) WARNING 10-19 01:02:59 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node c91702895b966640dd56518e37f3600dad9b471a3609335fd89f1d4d. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
(EngineCore_0 pid=1089) INFO 10-19 01:02:59 [ray_distributed_executor.py:169] use_ray_spmd_worker: True
(EngineCore_0 pid=1089) (pid=1193) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
(EngineCore_0 pid=1089) (pid=1193) import pynvml # type: ignore[import]
(EngineCore_0 pid=1089) (pid=1193) INFO 10-19 01:03:01 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:63] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:65] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_V1', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_RAY_COMPILED_DAG', 'MAX_JOBS', 'VLLM_USE_RAY_SPMD_WORKER', 'CUDA_HOME']
(EngineCore_0 pid=1089) INFO 10-19 01:03:03 [ray_env.py:68] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) [W1019 01:03:04.359149134 ProcessGroupNCCL.cpp:927] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [__init__.py:1418] Found nccl from library libnccl.so.2
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [pynccl.py:70] vLLM is using nccl==2.27.7
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) WARNING 10-19 01:03:04 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[], buffer_handle=None, local_subscribe_addr=None, remote_subscribe_addr='tcp://10.0.0.73:60481', remote_addr_ipv6=False)
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:05 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:06 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92) import pynvml # type: ignore[import]
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:06 [weight_utils.py:349] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:39<00:00, 39.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:39<00:00, 39.58s/it]
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193)
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [W1019 01:03:04.422445362 ProcessGroupNCCL.cpp:927] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:47 [default_loader.py:262] Loading weights took 40.55 seconds
(EngineCore_0 pid=1089) (pid=266, ip=10.0.0.92) INFO 10-19 01:03:02 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see
for more options.)
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [__init__.py:1418] Found nccl from library libnccl.so.2
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:04 [pynccl.py:70] vLLM is using nccl==2.27.7
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) WARNING 10-19 01:03:04 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:04 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:05 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=266, ip=10.0.0.92) INFO 10-19 01:03:06 [weight_utils.py:296] Using model weights format ['*.safetensors']
(EngineCore_0 pid=1089) (RayWorkerWrapper pid=1193) INFO 10-19 01:03:48 [gpu_model_runner.py:2007] Model loading took 18.6364 GiB and 42.791994 seconds
Do you use containers? 0.10.1.1+381074ae.nv25.09 looks like NVIDIA packaged version. Which should be ok.
Is this the script you are using?
Never tried myself to connect to nodes - I used only multiple GPUs in one instance yet. No fast ethernet connections to play with.
But I would expect some kind of error. Getting stuck after the weights have loaded is odd without any errors.
So 10.0.0.73 is your “head” node? And 10.0.0.92 the worker?
Anything to see on the second node? Anything to see on the Ray Cluster Dashboard (http://127.0.0.1:8265) what is mentioned in the log? I would expect some kind of handshake/sync issue.
This is the log from the worker. Looks similar.
:job_id:01000000
INFO 10-19 16:33:58 [__init__.py:241] Automatically detected platform cuda.
:actor_name:RayWorkerWrapper
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 10-19 16:34:01 [__init__.py:1418] Found nccl from library libnccl.so.2
INFO 10-19 16:34:01 [pynccl.py:70] vLLM is using nccl==2.27.7
WARNING 10-19 16:34:02 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 10-19 16:34:02 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 10-19 16:34:02 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
INFO 10-19 16:34:02 [gpu_model_runner.py:1953] Starting to load model empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit...
INFO 10-19 16:34:02 [gpu_model_runner.py:1985] Loading model from scratch...
INFO 10-19 16:34:02 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO 10-19 16:34:02 [cuda.py:328] Using Flash Attention backend on V1 engine.
INFO 10-19 16:34:03 [weight_utils.py:296] Using model weights format ['*.safetensors']
Oh wow it looks like it was processing something! It took about 15 minutes but it started posting new logs and now the server is up and running. I guess I just needed to wait…
(EngineCore_0 pid=1164) (RayWorkerWrapper pid=315, ip=10.0.0.92) INFO 10-19 16:34:03 [weight_utils.py:296] Using model weights format ['*.safetensors']
(EngineCore_0 pid=1164) (RayWorkerWrapper pid=1273) INFO 10-19 16:35:11 [gpu_model_runner.py:2007] Model loading took 18.6364 GiB and 68.875958 seconds
(EngineCore_0 pid=1164) (RayWorkerWrapper pid=315, ip=10.0.0.92) INFO 10-19 16:49:11 [weight_utils.py:312] Time spent downloading weights for empirischtech/DeepSeek-R1-Distill-Llama-70B-gptq-4bit: 907.372217 seconds
😅 Nice. As there where no errors, I hoped to see more in the dashboard. I expected that it was trying to do something. The recent job looks like it has (or had) a spinning wheel in front.
Does it now show a deployment after everything loaded?
Glad to hear that it only took some more time. So if I will ever have the opportunity to get two boxes networked together I will start with a smaller model first to see whether it does work.
Yes it logs that the server is running and I was able to connect to it via Roo Code. So it appears all is good.
Look into this thread. There’s a ton of information and insights and code about using vLLM and Ray on stacked sparks, including some improvements on loading times:
