Best Local LLM for Ralph Loop

Nemani · March 11, 2026, 4:56pm

Hi DGX Spark-ers & GB10-ers,

I recently found some success with a ralph loop that I was running - but I found that my loop is burning ~$50 in API credits from Claude a day. I want to continue using my ralph loop but I want to it to use a local LLM on my GB10.

To that end, what is the best (balance between speed, quality and performance) LLM I should use to feed into my ralph loop? Any recommendations from the broader community of experts?

Thanks!

cosinus · March 11, 2026, 5:26pm

Who the f… is Ralph? To be honest I - didn’t heard of it before…

It is used for Coding if my quick consultation of Dr. Google is correct?

Then you should try Qwen/Qwen3-Coder-Next as FP8 running in vLLM.

To ease the (current) pain having the “right” version of vLLM, libraries etc. use: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub

Comes with batteries included:

github.com/eugr/spark-vllm-docker

recipes/qwen3-coder-next-fp8.yaml

main

# Recipe: Qwen3-Coder-Next-FP8
# Qwen3-Coder-Next model in native FP8 format


recipe_version: "1"
name: Qwen3-Coder-Next-FP8
description: vLLM serving Qwen3-Coder-Next-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3-Coder-Next-FP8

#solo_only: true

# Container image to use
container: vllm-node

# Mod required to fix slowness and crash in the cluster (tracking https://github.com/vllm-project/vllm/issues/33857)
mods:
  - mods/fix-qwen3-coder-next

This file has been truncated. show original

Also promising candidate: Qwen3.5-35B-A3B-FP8

github.com/eugr/spark-vllm-docker

recipes/qwen3.5-35b-a3b-fp8.yaml

main

# Recipe: Qwen/Qwen3.5-35B-A3B-FP8
# Qwen/Qwen3.5-35B-A3B model in native FP8 format


recipe_version: "1"
name: Qwen35-35B-A3B
description: vLLM serving Qwen3.5-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.5-35B-A3B-FP8

#solo_only: true

# Container image to use
container: vllm-node

# Mod required to fix slowness and crash in the cluster (tracking https://github.com/vllm-project/vllm/issues/33857)
mods:
  - mods/fix-qwen3-coder-next
  - mods/fix-qwen3.5-chat-template

This file has been truncated. show original

Single Spark, I assume?

Nemani · March 11, 2026, 5:27pm

@cosinus You’re the man - appreciate it!

Let me introduce you to ralphy :) GitHub - snarktank/ralph: Ralph is an autonomous AI agent loop that runs repeatedly until all PRD items are complete. · GitHub

Finally, yes - sadly only a single GB10 (for now)…

cosinus · March 11, 2026, 5:35pm

Thank you for the introduction. I am still way behind in regards of agent magic… to busy getting infra running or testing new models, patches vLLM versions… trying to change that. :-D

Another candidate that fits into one Spark: Intel/Qwen3.5-122B-A10B-int4-AutoRound

Intels int4 AutoRound is also very good.

In order to see what to expect in terms of speed head over to:

Nemani · March 11, 2026, 5:41pm

Brilliant!

kernelerror · March 11, 2026, 7:41pm

I can confirm - I’m running Intel/Qwen3.5-122B-A10B-int4-AutoRound for about two weeks now using it mainly for Opencode and Ralph and it is working pretty well. I’m getting consistent 25 t/s in c1. And about 40-50 t/s in c2. It works, but there is still problem with sudden stops because of tool calls ending up in the reasoning blocks. But that is something which Ralph solves pretty well, because it just loops until the PRD is done…

Topic		Replies	Views
Implementation Guide: DGX Spark with Qwen3.5-35B-A3B via llama.cpp for Claude Code DGX Spark / GB10 Projects llama , agentic-ai	3	855	April 2, 2026
Managing Local LLM Orchestration DGX Spark / GB10 Projects	11	984	March 13, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	8393	March 24, 2026
Code assist and rag (instruct) in single node DGX Spark / GB10 Projects	2	290	February 14, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4376	April 11, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	17	2596	April 10, 2026
Building Local + Hybrid LLMs on DGX Spark That Outperform Top Cloud Models DGX Spark / GB10 Projects jetson , nim , llama3-70b-instruct , llama , deepseek , nemotron	19	4410	March 15, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14522	March 24, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4554	March 16, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2163	February 25, 2026

Best Local LLM for Ralph Loop

Related topics