Hey GB10 community,
I’m running a DGX Spark (128GB) and also have an RTX 5090 (32GB) on my LAN via 10GbE . Looking to optimize for agentic coding workflows. Would love community input on what’s working in January 2026.
My Use Case: Orchestrator-Workers Pattern
I’m building a local orchestrator-workers agentic system for coding tasks:
The orchestrator dynamically determines subtasks and spawns workers to execute in parallel—ideal for multi-file refactors where scope is unpredictable.
Hardware Complementarity
| Device | Memory | Bandwidth | Strength |
|---|---|---|---|
| DGX Spark (GB10) | 128GB | 273 GB/s | Large models, long context |
| RTX 5090 | 32GB | 1,792 GB/s | Raw speed (~6.5x faster decode) |
This suggests: Orchestrator on Spark (needs context/memory) + Fast workers on 5090 (benefits from speed)?
Question: Hybrid GB10 + 5090 Clustering?
I’ve seen a few approaches for distributed inference across heterogeneous hardware:
- EXO Combines DGX Spark and Mac Studio to Accelerate Large Language Model Inference — Demonstrated DGX Spark + Mac Studio via 10GbE, using disaggregated prefill/decode pipeline. Achieved 2.8x speedup. Experimental but designed for heterogeneous clusters.
- Distributed Inference and RPC | ggml-org/llama.cpp | DeepWiki — Built-in distributed inference over TCP. Run rpc-server on each GPU, connect via --rpc flag. Backend-agnostic (CUDA ↔ ROCm tested). 10GbE should work well (~48 t/s reported on gigabit).
- vLLM + Ray — Designed more for homogeneous clusters. Docs recommend containers to “hide host heterogeneity” rather than exploit it.
Has anyone successfully combined a Spark + discrete GPU (5090/4090/etc) over network? What framework worked? What was the latency overhead vs single-device?
Framework & Model Questions
- Framework for orchestrator-workers on single Spark: vLLM for parallel worker batching? Or llama.cpp for low-latency orchestrator calls?
- Best AWQ models for 128GB (with KV cache headroom):
- Orchestrator: DeepSeek-V3.2 AWQ? Qwen3-30B? Best for task decomposition + tool-calling?
- Workers: Qwen3-Coder smaller variants? Optimized for file-level edits?
- If hybrid works: Could I run orchestrator on Spark (large context) and offload fast code-gen workers to the 5090?
- AWQ vs NVFP4 in 2026: Has Blackwell NVFP4 improved, or is AWQ still production default?
- Context window reality: What’s practical max before throughput tanks? 32K? 64K+?
My Setup:
- DGX Spark (128GB unified, 273 GB/s) — primary, 10GbE
- RTX 5090 (32GB GDDR7, 1.8 TB/s) — secondary, 10GbE
Anyone running hybrid setups or orchestrator-workers patterns locally? Curious what’s working.
The next step would be training very small models on the Spark to specialize in this kind of workflow.
Thanks!
