I’ve noticed that vLLM has consistently pegged at least two cores on my dual-Spark setup at all times (both nodes), despite vLLM supposedly “fixing” the problem of setting VLLM_SLEEP_WHEN_IDLE (this envar isn’t used anymore from what I can tell). Is this a problem with ray? Has anyone solved this?
Use --no-ray when launching.
Will I still be able to run the two-node cluster without ray?
Yeah. Go check the changelog for eugr’s repo and ensure your nodes are updated.
Okay, you’re referring to eugr’s Docker image. I’m curious how this actually affects the launch arguments to vLLM itself and how it’s configured to work without ray.
I figured it out. Using eugr’s launch commands for vllm as a guide, I first start the worker node (Spark #2) using these options:
--distributed-executor-backend mp \
--nnodes 2 --node-rank 1 --master-addr $HEAD_IP_ADDR \
--master-port $MASTER_PORT --headless
And then start the head node (Spark #1) using this:
--distributed-executor-backend mp \
--nnodes 2 --node-rank 0 --master-addr $HEAD_IP_ADDR \
--master-port $MASTER_PORT
This seems to be working okay, no ray needed. I also seem to be getting a bit better tokens/sec this way, but I don’t know if that’s due to ditching ray or possibly improvements to vllm since the last time I benchmarked.
You can also let it handle both nodes with one command: ./launch-cluster.sh --no-ray exec vllm serve .... I’ve been using this setup ever since it was launched and, with only 2 nodes, I don’t miss Ray .
Right, that’s if using eugr’s Docker image, but I’m running things manually so I’m not using that. However, getting rid of Ray really simplifies the process so it’s basically just keeping the packages in sync between the two Sparks, and launching vllm once per node. It’s working great now, and with slightly better performance (29 t/s with Qwen3.5-397B in 4 bit AutoRound).