Any tips on running Magistral?

ululusem · November 23, 2025, 2:10am

I’m running mistralai/magistral-small-2509 in LM Studio (but I tried with llama.cpp directly too) and basically best I can do is 10 tok/sec.

Here is my settings:
```
quantization: Q5_K_M
context length: 40k (max for this model)
GPU offload 40 (max)

CPU Thread pool: 20 (max) - I tested with 15, it’s same

Flash attention: On with K_Cache and V_Cache matching quants: Q5_0 (for both)
```
I wonder if anyone have a luck running it faster than 10 tok/sec?

P.S. Aalso please let me know if I’m posting in wrong forum.

cosinus · November 23, 2025, 3:28pm

As you already have a GGUF you could use a more current version of llama.cpp instead.

llama.cpp has gotten some optimizations in co op with NVIDIA and the team behind llama.cpp.

If you prefer using a UI for swapping models have a look at llama swap (has llama.ccp underneath).

cosinus · November 23, 2025, 5:46pm

FTR Tested against local build 7108 (7d77f0732) of llama.cpp - getting 14 t/s. Not that much of an improvement.

Tested also against a current build of vLLM (0.11.2.dev188+g730bd3537.d20251122) - with an AWQ Quant (4-bit, cpatonn/Magistral-Small-2509-AWQ-4bit) - getting up to 16 t/s.

My box just arrived this week as Germany was late for release - so I was curious as everybody else is preferring to test against MoEs which seem to be made for this kind of boxes. Mistral did some MoEs in the past. Hope they will release new ones, too.

Neurfer · November 23, 2025, 6:42pm

@ululusem How did you get LMStudio AppImage to work?. I gave up on that and have been running llama.cpp or node-llama-cpp which gives 14 tps. vLLM in docker is about same too. After I saw your post, I tried the newest download of LMStudio but still broken in DGX Spark.

llama_perf_sampler_print:    sampling time =       0.73 ms /    19 runs   (    0.04 ms per token, 26027.40 tokens per second)
llama_perf_context_print:        load time =    5256.31 ms
llama_perf_context_print: prompt eval time =     257.97 ms /    20 tokens (   12.90 ms per token,    77.53 tokens per second)
llama_perf_context_print:        eval time =   61662.61 ms /   884 runs   (   69.75 ms per token,    14.34 tokens per second)
llama_perf_context_print:       total time =   67856.63 ms /   904 tokens
llama_perf_context_print:    graphs reused =        880
llama_memory_breakdown_print: | memory breakdown [MiB] |  total    free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (GB10)       | 122570 = 82405 + (14208 = 13302 +     640 +     266) +       25956 |
llama_memory_breakdown_print: |   - Host               |                     378 =   360 +       0 +      18                |

ululusem · November 23, 2025, 7:08pm

Go to this one https://lmstudio.ai/download - make sure you selected Linux - ARM64 (any latest versions should work, but I’m running 0.3.30). Then you can run it like this: ./LM-Studio-0.3.30-2-arm64.AppImage --no-sandbox. This should work.
Also can you post your args for 14tps?

ululusem · November 23, 2025, 7:09pm

Thanks, I’ll try AWQ quant.

ululusem · November 23, 2025, 9:58pm

Yeah, I just tested latest (0.3.32-2) and it’s working. Also, make sure to give it exec rights:
chmod +x LM-Studio-0.3.32-2-arm64.AppImage

Neurfer · November 25, 2025, 9:56pm

For llama-cli?

llama-cli -m ~/.node-llama-cpp/models/hf_mistralai_Magistral-Small-2509.Q4_K_M.gguf -p "Generate a long story.  The only requirement is long.

“–no-sandbox” option brings up the LMStudio but fail to load the model, any model, no matter what setting is.

🥲 Failed to load the model

Error loading model.

(Exit code: 127). Please check settings and try loading the model again.

ululusem · November 26, 2025, 5:24pm

For llama-cli?

yeah, thank you.

🥲 Failed to load the model

Error loading model.

(Exit code: 127). Please check settings and try loading the model again.

What model you are trying to load and with what settings?

also in the Mission Control/Runtime make sure that llama.cpp is latest (for me right now it’s v1.58.0).

Neurfer · November 26, 2025, 5:56pm

Your original post said “I’m running mistralai/magistral-small-2509 in LM Studio (but I tried with llama.cpp directly too) and basically best I can do is 10 tok/sec.”

I am saying LMStudio never really worked for me and that’s the error I got from the log when I try to load mistralai/magistral-small-2509 or any other model with the setting you posted or anything else. But it’s OK if I can’t get LMStudio work because llama.cpp works fine and returns better TPS rate and that’s what I prefer anyway. I will be using DGX Spark headless mostly so Terminal and WebUI are OK, GUI apps like LMStudio are not. I was just purely curious how did you get LMStudio working. And yes, my llama.cpp installation(built) is the latest. I built it 2 days ago from the latest source.

raphael.amorim · November 28, 2025, 3:58pm

Try this one on vLLM:

ululusem · December 3, 2025, 3:46pm

Basically this code means that LMStudio doesn’t like the settings (usually very big context length that overflows the VRAM will cause it), also LMStudio downloads own version of llama.cpp so checking the settings and see if it’s passed checks works too. This just FYI.
I’m using it also headless (I’m running models in server mode and connect to LMStudio models via API), but yeah self built llama.cpp gives better speed… which is weird… because LMStudio uses it as back-end to run models…

Topic		Replies	Views
Compiling llama.cpp DGX Spark / GB10 llama	14	953	February 7, 2026
Help on llama.cpp command line arguments and compilation settings (performance testing included) DGX Spark / GB10 performance , generative_ai , llama , nemotron	7	599	January 9, 2026
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	4366	January 20, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	1694	February 25, 2026
Best LLM engine for several parallel models? DGX Spark / GB10 agentic-ai	6	439	January 6, 2026
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	4	4278	February 18, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	21	2736	January 25, 2026
DGX Spark performance DGX Spark / GB10	50	2832	February 27, 2026
Trouble with Llama 70b 3.3 Instruct FP8 Model at 3 tokens per second DGX Spark / GB10 llama	14	240	February 17, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	1070	January 7, 2026

Any tips on running Magistral?

Related topics