Hello everyone, please help me choose the optimal hardware for our task. We’ll be developing a multi-agent system with a multi-level orchestrator, which is being created to support a team of engineers and scientists developing new products. Essentially, we want to create our own corporate AI, which will have engineering and programming knowledge, meaning it will write code, CAD drawings and design and technical documentation. The AI recommends choosing Granite 34B and deploying our AI on it. Our budget is limited to $5,500-$6,000, which is quite a large sum for Ukraine.
We were initially leaning toward purchasing a Spark, but after reading reviews, we’re now at a crossroads of what to buy. Dear gurus, please advise.
How large is the team? For your budget, there are no good options, really. RTX 6000 Pro would be a better option than Spark for a team, but it’s 2x the price. At this price range, the best value would be to do a custom build with a server motherboard and 4x RTX3090. It will give you 96GB VRAM and much better speeds than Spark.
Spark might work for a very small team, but you can forget about running dense models on it. Granite 34B is not the best anyway, but it will run at under 10 t/s, which is just way too slow for production use.
For Spark you need sparse MOE models. GPT-OSS-120B is probably the best overall model right now, but you may need to use several or even fine tune one for your specific needs.
Dear GURU, thank you for your reply.
There are 5 people on the team, I’m not a specialist or an expert, but how can Granit, with 34 billion parameters, perform worse than OSS, with 120 billion parameters?
Maybe use a 5080 or 5090; they’re easier to find in Ukraine.
Currently, I have a limited budget for the first hardware for the local AI. We have grant funding, and the first grant is only $12,000, which we need to use to build the first prototype of a firefighting drone. After that, we’ll receive $50,000-$100,000, and then we can buy a proper server for the local AI.
Basically, I want my own personal Jarvis, like Tony Stark from Iron Man =))
Granite is a dense model. It means that all 34B parameters participate in generating every token.
GPT-OSS is a sparse Mixture-of-Experts (MoE) model where only 5.1B parameters are active at any given time, so in terms of speed it’s on par with ~7B dense model.
To add to what @eugr said, it comes down to how much of the model it uses for each token produced. A dense model is one where every parameter has to be processed to produce a token, that means it needs to read and do math on all of those 34 billion parameters every time it produces a token.
To put it in terms of size, if you run a 4-bit quantised version of that 34B model, it needs to read 17GB of data in memory to produce 1 token, since the DGX Spark has 273GB/s of memory bandwidth, the highest theoretical performance would be: 273 / 17 = 16 tokens per second. And like I said, that’s theoretical, you’re unlikely to actually achieve that number.
An MoE model is smarter in the sense that it doesn’t need to do processing on every single parameter. Everyone uses them these days, even OpenAI who makes ChatGPT, in fact, GPT-OSS:120B is a model made by OpenAI. So even though GPT-OSS:120B has 120 billion parameters, as @eugr said, it only needs to do processing on 5.1 billion of those parameters for each token produced. So that means it only needs to read 2.55GB from memory to produce each token, so theoretical max performance is: 273 / 2.55 = 67 tokens per second. Again, this is just theoretical, you will always lose some performance in practice. But it should give you an idea of why you’d want to use an MoE model, especially on a memory bandwidth constrained device like the Spark.
Reasons to use GPT-OSS:120B over Granite 34B:
It’s a far more powerful model @ 120 billion parameters.
It’s newer (Granite 34B = April 2024, GPT-OSS:120b = August 2025).
If you’re going to get a Spark, you have the memory to run a 120B model.
It’s going to be much faster than Granite 34B.
It’s made by OpenAI, and I don’t know what it is, but these guys just seem to know how to make a good model.
You have options for fine-tuning it on the Spark, like: Unsloth and Llama-Factory.
Thank you for your reply, dear GURUs. I’ll upgrade to the GPT OSS 120B model, as it also has an Apache 2.0 license. Claude and Gemini write that the performance will be more or less suitable for 60± tokens/s. I still won’t risk buying a laptop or PC based on the 5090. We’ll stick with SPARK. Another question: is it possible to expand the memory to, say, 8 TB using an external hard drive, i.e., add another 4 TB on the external drive? I want it to be as smart as possible. We don’t have many people at this stage, and only have 6 months to produce the first prototype. We already have part of the brains, namely the early fire detection system. We just need to develop the chassis and fire suppression methods to meet the stated grant specifications.
Adding more storage isn’t going to make it smarter… that has to do with memory and what models you can run in that memory, it has 128GB. In terms of storage, 4TB is plenty. I hope you’ll be okay, reading this I can’t help but be worried.
Thank you for your reply, dear GURU. Claude AI estimates that the size of the database containing the information that OSS 120B will handle will be approximately 3 TB. I simply wanted to maximize his engineering knowledge so that he could help us develop our products.
Thank you for your concern, and I express my sincere gratitude for your help and concern. Last night was truly restless. Normally, the MIG 31K alarm is no big deal for us, but during it, a ballistic missile was launched and damaged an electrical substation 40 km away from me, in the Odessa region, Ukraine.
In llama.cpp built from master branch. Some recent commits made it a tad slower, but still getting 58 t/s.
My benchmarks are there in that github thread.
VLLM gives me ~34 t/s on single spark and 55 on dual, and SGLang about 52 t/s on single and 75 t/s dual.
No problems with overheating so far. Just need to make sure that the airflow is not obstructed. The intake is a mesh in the front, and exhaust is in the back, so need to make sure there is enough clearance to avoid hot air buildup.
Just to follow up on this, @eugr helped me solve the problem I was having with my performance. I was running the Unsloth version of GPT-OSS:120b, and turns out it’s a requant with worse performance than the official ggml version, which is in native mxfp4 (the format it was originally released in). I’ve now switched to ggml-org/gpt-oss-120b-GGUF and I’m getting 58 t/s as @eugr mentioned in his earlier post.
Dear GURUs, thank you for your answers. I’m waiting for the grant funds to arrive and will be purchasing SPARK. After the purchase, I’ll write back and share my impressions.
Thank you for your help, and I wish you all the best, and God bless you.