FTR Tested against local build 7108 (7d77f0732) of llama.cpp - getting 14 t/s. Not that much of an improvement.
Tested also against a current build of vLLM (0.11.2.dev188+g730bd3537.d20251122) - with an AWQ Quant (4-bit, cpatonn/Magistral-Small-2509-AWQ-4bit) - getting up to 16 t/s.
My box just arrived this week as Germany was late for release - so I was curious as everybody else is preferring to test against MoEs which seem to be made for this kind of boxes. Mistral did some MoEs in the past. Hope they will release new ones, too.
@ululusem How did you get LMStudio AppImage to work?. I gave up on that and have been running llama.cpp or node-llama-cpp which gives 14 tps. vLLM in docker is about same too. After I saw your post, I tried the newest download of LMStudio but still broken in DGX Spark.
llama_perf_sampler_print: sampling time = 0.73 ms / 19 runs ( 0.04 ms per token, 26027.40 tokens per second)
llama_perf_context_print: load time = 5256.31 ms
llama_perf_context_print: prompt eval time = 257.97 ms / 20 tokens ( 12.90 ms per token, 77.53 tokens per second)
llama_perf_context_print: eval time = 61662.61 ms / 884 runs ( 69.75 ms per token, 14.34 tokens per second)
llama_perf_context_print: total time = 67856.63 ms / 904 tokens
llama_perf_context_print: graphs reused = 880
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - CUDA0 (GB10) | 122570 = 82405 + (14208 = 13302 + 640 + 266) + 25956 |
llama_memory_breakdown_print: | - Host | 378 = 360 + 0 + 18 |
Go to this one https://lmstudio.ai/download - make sure you selected Linux - ARM64 (any latest versions should work, but I’m running 0.3.30). Then you can run it like this: ./LM-Studio-0.3.30-2-arm64.AppImage --no-sandbox. This should work.
Also can you post your args for 14tps?
Your original post said “I’m running mistralai/magistral-small-2509 in LM Studio (but I tried with llama.cpp directly too) and basically best I can do is 10 tok/sec.”
I am saying LMStudio never really worked for me and that’s the error I got from the log when I try to load mistralai/magistral-small-2509 or any other model with the setting you posted or anything else. But it’s OK if I can’t get LMStudio work because llama.cpp works fine and returns better TPS rate and that’s what I prefer anyway. I will be using DGX Spark headless mostly so Terminal and WebUI are OK, GUI apps like LMStudio are not. I was just purely curious how did you get LMStudio working. And yes, my llama.cpp installation(built) is the latest. I built it 2 days ago from the latest source.
Basically this code means that LMStudio doesn’t like the settings (usually very big context length that overflows the VRAM will cause it), also LMStudio downloads own version of llama.cpp so checking the settings and see if it’s passed checks works too. This just FYI.
I’m using it also headless (I’m running models in server mode and connect to LMStudio models via API), but yeah self built llama.cpp gives better speed… which is weird… because LMStudio uses it as back-end to run models…