Best Local LLM for Ralph Loop

Hi DGX Spark-ers & GB10-ers,

I recently found some success with a ralph loop that I was running - but I found that my loop is burning ~$50 in API credits from Claude a day. I want to continue using my ralph loop but I want to it to use a local LLM on my GB10.

To that end, what is the best (balance between speed, quality and performance) LLM I should use to feed into my ralph loop? Any recommendations from the broader community of experts?

Thanks!

Who the f… is Ralph? To be honest I - didn’t heard of it before…

It is used for Coding if my quick consultation of Dr. Google is correct?

Then you should try Qwen/Qwen3-Coder-Next as FP8 running in vLLM.

To ease the (current) pain having the “right” version of vLLM, libraries etc. use: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

Comes with batteries included:

Also promising candidate: Qwen3.5-35B-A3B-FP8

Single Spark, I assume?

1 Like

@cosinus You’re the man - appreciate it!

Let me introduce you to ralphy :) GitHub - snarktank/ralph: Ralph is an autonomous AI agent loop that runs repeatedly until all PRD items are complete. · GitHub

Finally, yes - sadly only a single GB10 (for now)…

1 Like

Thank you for the introduction. I am still way behind in regards of agent magic… to busy getting infra running or testing new models, patches vLLM versions… trying to change that. :-D

Another candidate that fits into one Spark: Intel/Qwen3.5-122B-A10B-int4-AutoRound

Intels int4 AutoRound is also very good.

In order to see what to expect in terms of speed head over to:

1 Like

Brilliant!

I can confirm - I’m running Intel/Qwen3.5-122B-A10B-int4-AutoRound for about two weeks now using it mainly for Opencode and Ralph and it is working pretty well. I’m getting consistent 25 t/s in c1. And about 40-50 t/s in c2. It works, but there is still problem with sudden stops because of tool calls ending up in the reasoning blocks. But that is something which Ralph solves pretty well, because it just loops until the PRD is done…