Seeing what sort of performance is possible on this forum for mid-sized models at good quantizations made me start out with two Asus GB-10.
2 x asus GX10, thinking to add two more but the 2 â 4 cost is quite steep!
Got 2 emails within hours a month back, from Z.ai & Github Copilot about changes to T&C & usage limits. At the same time, my Claude sub started hitting limits in half the time it used to. Just so much is changing, with AI Inference providers, that it just made sense to start with this platform.
Current - 2 Asus GX-10 (with ConnectX-7).
Current - 2 DGX Sparks (with ConnectX-7)
Weâve grown to six 8x clusters along with a couple more GB10s for testing software installation on. Weâre using CRS804 switches. Theyâve been great for running huge models on for agentic use cases. I think I can safely say that if anyone has a use case for running the big open-source models 24/7, they canât go wrong with one or more 8x clusters. Itâs made local AI actually make sense for the first time compared with the large clouds and thereâs simply not a viable alternative for that other than the Sparks.
@ash.x.kingsley You have to tell us more
@sjug, sure just ask any questions and Iâll try my best to answer within any limits of what I can discuss about it. We mostly use them for agentic programming, but also for various odds and ends like documentation, report generation, or automated build and test pipelines. Weâre very grateful to the amazing open source community thatâs building up around the Sparks and thatâs a significant reason we felt comfortable getting into this. Itâs gone so well though, thereâs no looking back. Weâre really hopeful that ever larger models will continue to be released open source and see the utility of the Sparks only increasing going forward.
4, started this last fall with an Asus GX10, then got a HP ZGX on a deal. And just added 2x Acer GN100s.
@aolsucks42, whatâs been your go-to model on your 4x cluster?
How do you get useful performance out of them? Or are the workloads less interactive where the response times are less important?
Why the spark over RTX PRO 6000s or even H200s?
Do you run large models like GLM or Kimi? Is it all 8x running a single model?
What are you doing for 200G switching for such large clusters?
@sjug, well in our view anything ~10+ tokens per second is fine even for interactive use cases, but yeah, almost all of our use is having agents running in the background 24/7. Live interaction isnât happening all the time and when it does the agent has already responded so thereâs no slow down there. Theyâre running 24/7 so itâs even less of a concern after hours or on weekends or holidays. Yes, sometimes an agent gets stuck and the human can become the blocker because theyâre sleeping or whatever, but thatâs fine. I honestly donât personally understand the demand of having the absolute highest TPS possible and think itâs a good tradeoff for running the biggest models, especially since the cluster increases concurrency so there can be multiple agents running on the cluster simultaneously, which increases total TPS when thatâs factored in.
I think the DGX Spark is at a very specific place in the landscape of options. Beyond a hobbyist experimenting with one or two, itâs intended to be clustered and run the biggest models. There isnât an alternative for it. Iâd say the closest alternative would be a 4x cluster of Mac Studios with either 256GB or 512GB of RAM each, but those arenât even being sold now and have a number of tradeoffs compared to using NVIDIA hardware.
Discrete GPU-based solutions are much more expensive, use multiple times more electricity, and consequently produce much more heat. Weâre in a relatively modest office setting, that already has a lot of other equipment in it, so we canât really have a room like an AI data center here and anyone running them at home would be even more in that situation, often to the point where a Spark cluster would be their only choice period without Mac Studios being available at least.
Other than that, with the Sparks we were able to grow it very gradually, testing along the way so we didnât end up making incorrect assumptions about a new large expense. We were also able to buy them without needing to go through often annoying specialized sales channels. Being able to buy hardware at retail is much more convenient and itâs not like there are discounts to be had anyway due to the AI demand.
Weâve been evaluating a lot of different models during this journey. Lately weâve been using DeepSeek-V4-Pro, Kimi-K2.6, Kimi-K2.7-Code, MiniMax-M3, and GLM-5.2-FP8. Usually each 8x cluster is running a different model. Maybe weâll converge on one dominate model at some point, but we havenât accumulated enough experience to be able to declare a hands-down winner yet. Perhaps that could be GLM-5.2-FP8, but we havenât had much time with it yet.
As for the switches, itâs simple, each 8x cluster has its own MikroTik CRS804 switch with 400G to 2x200G breakout DAC cables.
Overall, I donât see any way we could have the total 6TB of memory capacity for LLMs like this any other way. Sure, it could be done using GPUs or Macs too, but thatâs not easy with the supply issues. I think the Sparks will be unavailable in due time too. Theyâre already heading to $6,000+. My only major complaint would be that we didnât see this coming and get started late last year, but at least we got in before most of the price increases.
I have 2 itâs more than enough for what I want to do tbh I think 2, or 4 is the sweet spot. Love these little things.
Starting 4, most likely ending with 4, but who know. Cheapest gx10 1TB versions
Just one GX10. Two is becoming a possibility. But I wouldnât go beyond that. I think.
I have two Asus GX-10 with CX7 interconnect
Same as robert287
I also have 8x cluster. Can you share the glm 5.2 yaml and dockerfile/mods if needed custom build, please? Also, what speed are you getting on the models you mentioned? I am looking for the best coding agent for 8x cluster that can achive above 20t/s at 100k token. I have Minimax M3 meeting this criteria and I hope I can get Mimo Pro dflash version running, but no luck si far..
Tower of babel:
1 HP ZGX 4TB
1 GX10 4TB
1 AMD 395+ AI 128gb => Oculink to 4060Ti, USB4 to RTX 3050
1 Mac Studio 192gb M2 ultra
(not pictured: 1 gaming PC turned AI PC â 64GB DDR5, RTX4000 Pro Blackwell, oculink â 5070Ti)
Every computer has its use and model loaded:
Spark cluster: Deepseek V4 Flash (Hermes main model)
Mac Studio: Deepseek V4 FLash via Dwarfstar (currently trying out GitHub - Rose22/openlumara: AI agent framework, written from scratch (not based on openclaw), focused on stripping it down to the bare necessities, optimizing token count, reducing security risks. modular so you can enable only exactly what you need. · GitHub)
Strix: Qwen3.6-35B high concurrency cluster (3 instances with 4 concurrencies, aggregate 300toks/s) for Feynman
PC: Qwen-27b, Gemma4-12b for Claude-Code
â Strix also does: Embedding, reranking, voice (NPU driven), hindsight memory recall, and is the hub for all agents
If I had another 8500, I would get another 2 sparks from bestbuy.




