DGX Spark + RTX 6000 Pro Blackwell — disaggregated inference

EXO Labs showed the pattern works: 2× DGX Spark for prefill + M3 Ultra Mac Studio for decode → ~2.8× e2e on Llama-3.1 8B / 8K prompt over plain 10 GbE. Spark is compute-strong / bandwidth-weak; Mac Studio’s wide memory bus eats decode.

RTX 6000 Pro Blackwell looks like a strictly stronger decode partner than the Mac Studio. On paper:

  • Memory bandwidth: ~1.79 TB/s vs M3 Ultra ~800 GB/s

  • Capacity: 96 GB GDDR7 — fits 70B-class FP8, headroom for KV at long context

  • Same silicon family as Spark (Blackwell) — NVFP4 native on both sides, no MLX<->CUDA boundary

  • Interconnect: Spark’s ConnectX-7 200 Gbps QSFP straight into the 6000 Pro host, vs EXO’s 10 GbE

  • Software stack: all CUDA — vLLM / SGLang already do PD-disagg on H100 + GB200, should drop down

So the question: has anyone actually wired Spark prefill → RTX 6000 Pro decode?

I think I saw a comment on this board saying that no one has actually reproduced this setup yet

There is a project - more of a proof of concept though - called speculative speculative decoding which supports a drafter located in a separate machine. You need high bandwidth between units to make it work.

It would be really interesting if the Spark supported an eGPU, but none of the ports are Thunderbolt.

A ConnectX-7 in the RTX PRO 6000 machine would be required, and you’d be working off the edge of the map based on a research proof of concept.

Yes, it’s documented here: Distributed inference cluster: DGX Spark – RTX 6000 Pro – DevQuasar

So DevQuasar was using TP not EXO’s DP..

DP would allow Sparks to aggregate the prefill (compute-bound — Sparks have a ton of FLOPs collectively), 6000 Pro handles decode alone (bandwidth-bound — plays to its 1.79 TB/s).

I posted that. And, to be fair, one guy with a youtube channel did it very recently, but its not straightforward. The guy who showed it working is in this forum as well (Alex @alexander.ziskind). They (Exo) certainly have never released it, although they may in the future.
It seems that the network latency is an issue because macs dont have PCIE NICs. I’m not sure how Alex got around that with the thunderbolt-mellanox enclosures, but maybe the latency is ok at high enough bandwidth – he seemed to do ok with 40GBE. I have those cards laying around, and thunderbolt enclosures (albeit TB3/4), and a mac studio (M2 Ultra) so I am looking to try this, but so far its a theoretical with few if any true examples and I am not even sure if you can do it without higher speed thunderbolt (he used an M3 ultra which carries TB5 ports)

Seems odd, doesn’t the RTX6000 have way more compute than 2x spark? (but even so, you need to do 1x spark to 1x rtx6000, wouldn’t you?). It doesnt feel like you’d get any performance uplift at all.

I benchmarked 2x RTX 6000 vs 2x Spark on MiniMax M2.7 AWQ-4bit: GPU Benchmark Comparison