I recently found some success with a ralph loop that I was running - but I found that my loop is burning ~$50 in API credits from Claude a day. I want to continue using my ralph loop but I want to it to use a local LLM on my GB10.
To that end, what is the best (balance between speed, quality and performance) LLM I should use to feed into my ralph loop? Any recommendations from the broader community of experts?
Thank you for the introduction. I am still way behind in regards of agent magic… to busy getting infra running or testing new models, patches vLLM versions… trying to change that. :-D
Another candidate that fits into one Spark: Intel/Qwen3.5-122B-A10B-int4-AutoRound
Intels int4 AutoRound is also very good.
In order to see what to expect in terms of speed head over to:
I can confirm - I’m running Intel/Qwen3.5-122B-A10B-int4-AutoRound for about two weeks now using it mainly for Opencode and Ralph and it is working pretty well. I’m getting consistent 25 t/s in c1. And about 40-50 t/s in c2. It works, but there is still problem with sudden stops because of tool calls ending up in the reasoning blocks. But that is something which Ralph solves pretty well, because it just loops until the PRD is done…