Best Inference Framework & Open Models for Orchestrator-Workers Agentic Coding on GB10 + 5090 Hybrid?

LsDmTandAI · January 5, 2026, 6:00am

Hey GB10 community,

I’m running a DGX Spark (128GB) and also have an RTX 5090 (32GB) on my LAN via 10GbE . Looking to optimize for agentic coding workflows. Would love community input on what’s working in January 2026.

My Use Case: Orchestrator-Workers Pattern

I’m building a local orchestrator-workers agentic system for coding tasks:

The orchestrator dynamically determines subtasks and spawns workers to execute in parallel—ideal for multi-file refactors where scope is unpredictable.

Hardware Complementarity

Device	Memory	Bandwidth	Strength
DGX Spark (GB10)	128GB	273 GB/s	Large models, long context
RTX 5090	32GB	1,792 GB/s	Raw speed (~6.5x faster decode)

This suggests: Orchestrator on Spark (needs context/memory) + Fast workers on 5090 (benefits from speed)?

Question: Hybrid GB10 + 5090 Clustering?

I’ve seen a few approaches for distributed inference across heterogeneous hardware:

EXO Combines DGX Spark and Mac Studio to Accelerate Large Language Model Inference — Demonstrated DGX Spark + Mac Studio via 10GbE, using disaggregated prefill/decode pipeline. Achieved 2.8x speedup. Experimental but designed for heterogeneous clusters.
Distributed Inference and RPC | ggml-org/llama.cpp | DeepWiki — Built-in distributed inference over TCP. Run rpc-server on each GPU, connect via --rpc flag. Backend-agnostic (CUDA ↔ ROCm tested). 10GbE should work well (~48 t/s reported on gigabit).
vLLM + Ray — Designed more for homogeneous clusters. Docs recommend containers to “hide host heterogeneity” rather than exploit it.

Has anyone successfully combined a Spark + discrete GPU (5090/4090/etc) over network? What framework worked? What was the latency overhead vs single-device?

Framework & Model Questions

Framework for orchestrator-workers on single Spark: vLLM for parallel worker batching? Or llama.cpp for low-latency orchestrator calls?
Best AWQ models for 128GB (with KV cache headroom):
- Orchestrator: DeepSeek-V3.2 AWQ? Qwen3-30B? Best for task decomposition + tool-calling?
- Workers: Qwen3-Coder smaller variants? Optimized for file-level edits?
If hybrid works: Could I run orchestrator on Spark (large context) and offload fast code-gen workers to the 5090?
AWQ vs NVFP4 in 2026: Has Blackwell NVFP4 improved, or is AWQ still production default?
Context window reality: What’s practical max before throughput tanks? 32K? 64K+?

My Setup:

DGX Spark (128GB unified, 273 GB/s) — primary, 10GbE
RTX 5090 (32GB GDDR7, 1.8 TB/s) — secondary, 10GbE

Anyone running hybrid setups or orchestrator-workers patterns locally? Curious what’s working.

The next step would be training very small models on the Spark to specialize in this kind of workflow.

Thanks!

raphael.amorim · February 19, 2026, 4:14am

Check some options for single spark here: Spark Arena - LLM Leaderboard. Look for high-concurrency and good prompt-processing for 32-64k

Topic		Replies	Views
Now running 2x DGX Spark stacked over QSFP56 looking for model recs for agentic workloads (Hermes / OpenClaw) DGX Spark / GB10 Projects agentic-ai , deepseek , openclaw	27	2682	May 12, 2026
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	9	2042	February 13, 2026
DGX Spark by far the best inference (at the edge) option? DGX Spark / GB10 edgeai	2	1112	January 21, 2026
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	1299	May 11, 2026
Agentic DevOps with DGX Spark ?! DGX Spark / GB10 developer , agentic-ai	3	892	September 2, 2025
Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Technical Blog	2	222	May 8, 2026
DGX Spark performance DGX Spark / GB10	49	6329	February 13, 2026
Best models/configurations for agentic coding with DGX (Nvidia/Asus/Dell/Lenovo/MSI) DGX Spark / GB10 agentic-ai	2	1821	April 17, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	34	2649	May 1, 2026
Eigr's gold by Leathery Tendons DGX Spark / GB10 Projects	0	147	June 28, 2026

Best Inference Framework & Open Models for Orchestrator-Workers Agentic Coding on GB10 + 5090 Hybrid?

Related topics