No offense taken - its not my recipe, I just wanted to share it as it was buried in a main thread amidst many discussions and experiments. Plus we had few small hurdles to start and we fixed them and shared above. Could it be better? Absolutely. I would love better context handling. Is it usable for work as it is? Yes.
Hey guys. Been busy at work. Will push branch and new image on Sunday when I am back in town.
My fix was to apply a PR that fixed some metadata accumulation. It should help.
Was going to task Fable with this but as you may have heard…lol
The man of the hour himself! Thank you, Good Sir, enjoy you weekend. Whenever you drop it - it will be greatly appreciated!
running into, what seems like a basic config issue, but I can’t see what I’m doing wrong.
i set the NICs in the compose:
NCCL_IB_HCA: "rocep1s0f0,roceP2p1s0f0"
NCCL_SOCKET_IFNAME: "enp1s0f0np0,enP2p1s0f0np0"
but I’m getting an error for an unused interface. is “enP7s7” hardcoded somewhere ?
(Worker pid=31) ERROR 06-13 21:25:05 [multiproc_executor.py:870] RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:84] ifa != nullptr. Unable to find address for: enP7s7
Update: i was able to address this with the following in the compose environment section:
GLOO_SOCKET_IFNAME: enp1s0f0np0
TP_SOCKET_IFNAME: enp1s0f0np0
not sure if you need both
Author just commented a couple of posts back in this thread. Looked at the recipe and it’s exactly what the original implementation has. Just like them and the others who say it works…it works for me. Running as main model now, it is excellent to have a model used by most in the cloud as a local agent driver.
Now if we can get Minimax M3 going…only minus for DS4F is the lack of a vision tower :)
It’s identical in essence, she went an extra length to setup a docker compose lonely file as a github repo for easy clone instead of mkdir + cat. Kudos to her, less opportunity to mess up
Unless you use vision non stop most harnesses allow to route vision separately and a cloud vision model can be used from time to time. Solves the problem. Or if you have a gaming card on you workstation, small vision model can operate there
Qwen 3.5 .8 MLX is like 1-2gb I think it does great captioning of photos
Yeah you can run it alongside main model on one of the sparks easy.
Shout out to all on this topic. Very good effort and outcome. Thanks @MiaAI_Lab for the git, it allowed me the brief Sunday coffee time to enable the model here without too much pre-wake-up thinking. Hermes Agent + DSv4F is pretty good, fits an interaction I was missing. Thanks @aidendle94 and @0rand for the kick-off on all this, I almost dismissed this model due to lack of attention left in my wet brain :D
The thing is that the recipe you posted here was modified by the Forum somehow.
e.g, VLLM_NCCL_SO_PATH: /opt/env/lib/python3.12/…/libnccl.so.2
Yeah, but it’s best to check that things locally, you may have it installed differently
Find / - -name “libnccl.so.2”
Are you running mlx on the arm/cuda environment?
Minor update
aidendle94/sparkrun-vllm-ds4-gb10:production-v2 with [Bugfix] Fix linear host RSS growth under sustained classification load with prefix caching (V1) by Oxygen56 · Pull Request #44237 · vllm-project/vllm
edit:
Will push code tomorrow to github. Sleepy time
Thanks for this and have a good night.
I tested this and it seems to work. KV cache clears periodically. I let codex hammer it with 8 concurrent requests with ~455k-token unique prompts to force prefix-cache eviction pressure.
Observed behavior:
- KV usage did not monotonically climb.
- Under pressure, KV showed repeated eviction drops, for example:
- 0.389 → 0.334
- 0.410 → 0.339
- 0.411 → 0.338
- After all requests drained, with running=0 and waiting=0, KV stayed flat at idle.
- Final idle samples stayed exactly stable around 0.2212, and a later check stayed stable around 0.2253.
- Prefix cache was active and not poisoned:
- prefix_cache_queries_total ≈ 9.36M
- prefix_cache_hits_total ≈ 4.98M
- Server-side errors/aborts stayed at 0
May need to run it for a few days to confirmed.
A big thank you is in order for your hard work. Appreciated!
Thank you very much, upgraded. Will report results when I get close to limit (if it happens)!
Empirical evidence but evening of heavy use and cache grew to only 45 % and goes up very slowly. Usually much faster. Same pp and tg. Fix is working!
PS works like magic!
rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:31 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.64, Accepted throughput: 25.00 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 250 tokens, Drafted: 304 tokens, Per-position acceptance rate: 0.974, 0.671, Avg Draft acceptance rate: 82.2%
(APIServer pid=1) INFO 06-16 14:11:41 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 39.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:41 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.49, Accepted throughput: 23.60 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 236 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.886, 0.608, Avg Draft acceptance rate: 74.7%
(APIServer pid=1) INFO 06-16 14:11:51 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:11:51 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.46, Accepted throughput: 23.00 tokens/s, Drafted throughput: 31.60 tokens/s, Accepted: 230 tokens, Drafted: 316 tokens, Per-position acceptance rate: 0.892, 0.563, Avg Draft acceptance rate: 72.8%
(APIServer pid=1) INFO: 192.168.1.2:64972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 06-16 14:12:01 [loggers.py:271] Engine 000: Avg prompt throughput: 249.1 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.41, Accepted throughput: 13.50 tokens/s, Drafted throughput: 19.20 tokens/s, Accepted: 135 tokens, Drafted: 192 tokens, Per-position acceptance rate: 0.854, 0.552, Avg Draft acceptance rate: 70.3%
(APIServer pid=1) INFO 06-16 14:12:11 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 97.5%
(APIServer pid=1) INFO 06-16 14:12:11 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 30.80 tokens/s, Accepted: 254 tokens, Drafted: 308 tokens, Per-position acceptance rate: 0.974, 0.675, Avg Draft acceptance rate: 82.5%
Love to hear that. For the new image, would any configuration adjustments be needed?