Any tips on running Magistral?

I’m running mistralai/magistral-small-2509 in LM Studio (but I tried with llama.cpp directly too) and basically best I can do is 10 tok/sec.

Here is my settings:
```
quantization: Q5_K_M
context length: 40k (max for this model)
GPU offload 40 (max)

CPU Thread pool: 20 (max) - I tested with 15, it’s same

Flash attention: On with K_Cache and V_Cache matching quants: Q5_0 (for both)
```
I wonder if anyone have a luck running it faster than 10 tok/sec?

P.S. Aalso please let me know if I’m posting in wrong forum.

As you already have a GGUF you could use a more current version of llama.cpp instead.

llama.cpp has gotten some optimizations in co op with NVIDIA and the team behind llama.cpp.

If you prefer using a UI for swapping models have a look at llama swap (has llama.ccp underneath).

FTR Tested against local build 7108 (7d77f0732) of llama.cpp - getting 14 t/s. Not that much of an improvement.

Tested also against a current build of vLLM (0.11.2.dev188+g730bd3537.d20251122) - with an AWQ Quant (4-bit, cpatonn/Magistral-Small-2509-AWQ-4bit) - getting up to 16 t/s.

My box just arrived this week as Germany was late for release - so I was curious as everybody else is preferring to test against MoEs which seem to be made for this kind of boxes. Mistral did some MoEs in the past. Hope they will release new ones, too.

1 Like

@ululusem How did you get LMStudio AppImage to work?. I gave up on that and have been running llama.cpp or node-llama-cpp which gives 14 tps. vLLM in docker is about same too. After I saw your post, I tried the newest download of LMStudio but still broken in DGX Spark.

llama_perf_sampler_print:    sampling time =       0.73 ms /    19 runs   (    0.04 ms per token, 26027.40 tokens per second)
llama_perf_context_print:        load time =    5256.31 ms
llama_perf_context_print: prompt eval time =     257.97 ms /    20 tokens (   12.90 ms per token,    77.53 tokens per second)
llama_perf_context_print:        eval time =   61662.61 ms /   884 runs   (   69.75 ms per token,    14.34 tokens per second)
llama_perf_context_print:       total time =   67856.63 ms /   904 tokens
llama_perf_context_print:    graphs reused =        880
llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122570 = 82405 + (14208 = 13302 +     640 +     266) +       25956 |
llama_memory_breakdown_print: |   - Host               |                     378 =   360 +       0 +      18                |
1 Like

Go to this one https://lmstudio.ai/download - make sure you selected Linux - ARM64 (any latest versions should work, but I’m running 0.3.30). Then you can run it like this: ./LM-Studio-0.3.30-2-arm64.AppImage --no-sandbox. This should work.
Also can you post your args for 14tps?

Thanks, I’ll try AWQ quant.

Yeah, I just tested latest (0.3.32-2) and it’s working. Also, make sure to give it exec rights:
chmod +x LM-Studio-0.3.32-2-arm64.AppImage

For llama-cli?

llama-cli -m ~/.node-llama-cpp/models/hf_mistralai_Magistral-Small-2509.Q4_K_M.gguf -p "Generate a long story.  The only requirement is long.

“–no-sandbox” option brings up the LMStudio but fail to load the model, any model, no matter what setting is.

🥲 Failed to load the model

Error loading model.

(Exit code: 127). Please check settings and try loading the model again. 

For llama-cli?

yeah, thank you.

🥲 Failed to load the model

Error loading model.

(Exit code: 127). Please check settings and try loading the model again. 

What model you are trying to load and with what settings?

also in the Mission Control/Runtime make sure that llama.cpp is latest (for me right now it’s v1.58.0).

Your original post said “I’m running mistralai/magistral-small-2509 in LM Studio (but I tried with llama.cpp directly too) and basically best I can do is 10 tok/sec.”

I am saying LMStudio never really worked for me and that’s the error I got from the log when I try to load mistralai/magistral-small-2509 or any other model with the setting you posted or anything else. But it’s OK if I can’t get LMStudio work because llama.cpp works fine and returns better TPS rate and that’s what I prefer anyway. I will be using DGX Spark headless mostly so Terminal and WebUI are OK, GUI apps like LMStudio are not. I was just purely curious how did you get LMStudio working. And yes, my llama.cpp installation(built) is the latest. I built it 2 days ago from the latest source.

1 Like

Try this one on vLLM:

Basically this code means that LMStudio doesn’t like the settings (usually very big context length that overflows the VRAM will cause it), also LMStudio downloads own version of llama.cpp so checking the settings and see if it’s passed checks works too. This just FYI.
I’m using it also headless (I’m running models in server mode and connect to LMStudio models via API), but yeah self built llama.cpp gives better speed… which is weird… because LMStudio uses it as back-end to run models…