Maybe some of you have seen those videos from Chinese social media where people with shelves of Apple minis/studios/etc. to run AI since Nvidia was banned (on different levels at different times). I’m just wondering what the experience is like compared to our beloved Sparkies, including the software side.
Can’t seem to buy the 512 GB model anymore even in the US because they’re sold out. Though performance is a bit different, but at least on a capacity level, that would be equivalent to 4 DGX Sparks to run a 4 bit quant GLM 5.1.
So it piqued my interest and got me wondering how the developer/user experience is to put up vLLM on those machines compared to the DGX Spark. What’s lacking in the Apple machines? What’s lacking in DGX Sparks (besides how bad the state of NVFP4 is)? Anyone experience with both to compare?
Also, how is running multiple Spark experience? I see a lot of posts mentioning eugr’s solution, but just wondering if there are other existing solutions also.
A lot of the value of the Spark DGX is that you have the full nvidia stack up and down on the latest chipset. On macOS you have to contend with almost every single deep learning model was built on nividia and MPS is an afterthought at best. Expect a lot of models to have no support and fallback to CPU, memory leaks and incorrect or different processing results due to branching in common libraries.
However, the reviews on the new m5 for tensor processing and bandwidth have been really good. Combined with Intel’s latest GPU release and PS6 saying they will have RNN support for the game loop native and most likely on an AMD APU, there might be a bit more thought put into cross compatibility in coming years.
Inferencing software for Macos is not really production ready, i.e. nearly no parallel processing support. Nice for home use. Bad for small prod environments. And prompt processing on Apple is still bad. 512 GB is great, but models of that size are too heavy for the GPU anyhow. Except you can wait or do nightly batch processing…but even then it might just take too long which in turn increases energy consumption.
there is vLLM mlx (apple silicon).
You can run right now on 2-4 sparks glm-4.7 (395b) in nvfp4 (2 sparks, lots of efforts) or fp8 on 4 sparks, but not any of newer space attention models (glm-5.x, newer deepseek v3.x) as there is no such attention implementation for sm121/gb10 or even sm120/rtx pro 6000 (96gb vram).
on m3 ultra, token generation for 35b active parameters models, like in glm-4.7 395b-a35b ~10 tokens per second.
on 4x sparks running in tp=4 same 10tokens per second.
so, you get more for you money with m3 ultra 512 gb ($10k for 10tps, vs min $13.6k for sparks), but apple dont do any more m3 ultra 512gb, maybe it will be m5 ultra 512gb, but no one knows new prices.