Best Inference Framework & Open Models for Orchestrator-Workers Agentic Coding on GB10 + 5090 Hybrid?

Hey GB10 community,

I’m running a DGX Spark (128GB) and also have an RTX 5090 (32GB) on my LAN via 10GbE . Looking to optimize for agentic coding workflows. Would love community input on what’s working in January 2026.

My Use Case: Orchestrator-Workers Pattern

I’m building a local orchestrator-workers agentic system for coding tasks:

The orchestrator dynamically determines subtasks and spawns workers to execute in parallel—ideal for multi-file refactors where scope is unpredictable.

Hardware Complementarity

Device Memory Bandwidth Strength
DGX Spark (GB10) 128GB 273 GB/s Large models, long context
RTX 5090 32GB 1,792 GB/s Raw speed (~6.5x faster decode)

This suggests: Orchestrator on Spark (needs context/memory) + Fast workers on 5090 (benefits from speed)?

Question: Hybrid GB10 + 5090 Clustering?

I’ve seen a few approaches for distributed inference across heterogeneous hardware:

  1. EXO Combines DGX Spark and Mac Studio to Accelerate Large Language Model Inference — Demonstrated DGX Spark + Mac Studio via 10GbE, using disaggregated prefill/decode pipeline. Achieved 2.8x speedup. Experimental but designed for heterogeneous clusters.
  2. Distributed Inference and RPC | ggml-org/llama.cpp | DeepWiki — Built-in distributed inference over TCP. Run rpc-server on each GPU, connect via --rpc flag. Backend-agnostic (CUDA ↔ ROCm tested). 10GbE should work well (~48 t/s reported on gigabit).
  3. vLLM + Ray — Designed more for homogeneous clusters. Docs recommend containers to “hide host heterogeneity” rather than exploit it.

Has anyone successfully combined a Spark + discrete GPU (5090/4090/etc) over network? What framework worked? What was the latency overhead vs single-device?

Framework & Model Questions

  1. Framework for orchestrator-workers on single Spark: vLLM for parallel worker batching? Or llama.cpp for low-latency orchestrator calls?
  2. Best AWQ models for 128GB (with KV cache headroom):
    • Orchestrator: DeepSeek-V3.2 AWQ? Qwen3-30B? Best for task decomposition + tool-calling?
    • Workers: Qwen3-Coder smaller variants? Optimized for file-level edits?
  3. If hybrid works: Could I run orchestrator on Spark (large context) and offload fast code-gen workers to the 5090?
  4. AWQ vs NVFP4 in 2026: Has Blackwell NVFP4 improved, or is AWQ still production default?
  5. Context window reality: What’s practical max before throughput tanks? 32K? 64K+?

My Setup:

  • DGX Spark (128GB unified, 273 GB/s) — primary, 10GbE
  • RTX 5090 (32GB GDDR7, 1.8 TB/s) — secondary, 10GbE

Anyone running hybrid setups or orchestrator-workers patterns locally? Curious what’s working.

The next step would be training very small models on the Spark to specialize in this kind of workflow.

Thanks!

Check some options for single spark here: Spark Arena - LLM Leaderboard. Look for high-concurrency and good prompt-processing for 32-64k