I’m a student who paid for a DGX Spark out of pocket, and I was excited to use it to try the new Nemotron 30B A3B, a 1M ctx model. I’ve never done long context before due to lack of hardware.
Anyway, I thought it would be a lovely time to feed it my entire personal journal (just about 1M tokens) to ask questions about it.
So I got the Q8_0 gguf, and a llama.cpp CUDA build, and began experimenting with the prompt processing speed at different context lengths (I also wanted to get an idea of the DGX Spark’s performance). I used the command llama-bench -m noctrex_NVIDIA-Nemotron-3-Nano-30B-A3B-MXFP4_MOE.gguf --repetitions 1 -p 128000
Given this steep and continuous falloff, what speed can I even expect at 500k, 700k, 900k? My quick napkin math tells me I’ll be under 1 tok/sec by the time I’m at 500k, and this whole idea is impossible.
My question is of a more general nature, adding a 8% speed boost won’t help me.
I’m talking about Prompt Processing (how long it takes for the model to process the user’s input, before it starts generating output), not output generation speed. Prompt Processing speed gets worse as context grows. So I haven’t even gotten far enough to ask it questions: I’m worried it could take days (or years) before it can finish processing my initial prompt.
(Btw, I’m getting almost the same generation speed on llama.cpp as in your link. They got 67 t/s, I’m getting 61.)
The prompt processing speed is only needed to load the information in the context. After that, questions and answers will be fast, because it is not re-processing the entire prompt with each message. But, even the best LLMs don’t have perfect attention over enormous contexts.
I’m also not seeing the same performance numbers you are, maybe because I have a newer llama.cpp.
Here I am running at 256k context, which is more than you did in any of your tests:
On average over 256,000 tokens of prompt (-p), it was at 1600 tokens per second. If we measure only the prompt processing after 256,000 tokens of context, it is still going at a respectable 992 tokens per second.
I experimented on your command’s flags. Still sticking to my own build, not yours. Your big speed boost is due to using ‘-fa 1’. Testing with same build as in OP:
-p 32000 -fa 0: 1234 PP t/s (same as before)
-p 32000 -fa 1: 2158 (+74% compared to without Flash Attention)
-p 64000 -fa 0: 819 (-44% drop compared to -p 32000 -fa 0)
-p 64000 -fa 1: 2025 (+147% compared to without Flash Attention, -7% drop compared to -p 32000 -fa 1)
When I got the Spark, on the 1st GGUF I tried I noticed enabling Flash Attention added a mere 10% boost to PP and no other benefits, but it seems the difference is yuge depending on model.
Tests with other models, with -p 8192 and comparing -fa 0 vs -fa 1, here is PP speed:
Flash Attention reduces the amount of memory required to store KV cache (that’s why you are seeing performance boost!), so there is pretty much no reason to turn it off.