I’m a student who paid for a DGX Spark out of pocket, and I was excited to use it to try the new Nemotron 30B A3B, a 1M ctx model. I’ve never done long context before due to lack of hardware. Anyway, I thought it would be a lovely time to feed it my entire personal journal (just about 1M tokens) to…

Your benchmark is not matching these: [image] DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 Getting an NVFP4 quant of the new nemotron3-nano to work on the DGX Spark was challenging. However, copy and paste this and it should “just work”…

My question is of a more general nature, adding a 8% speed boost won’t help me. I’m talking about Prompt Processing (how long it takes for the model to process the user’s input, before it starts generating output), not output generation speed. Prompt Processing speed gets worse as context grows. So…

Some blackwell optimizations coming for llama.cpp [image] Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 Looks like we’re about to get some extra PP performance in llama.cpp via this PR https://github.com/ggml-org/llama.cpp/pull/…

The prompt processing speed is only needed to load the information in the context. After that, questions and answers will be fast, because it is not re-processing the entire prompt with each message. But, even the best LLMs don’t have perfect attention over enormous contexts. I’m also not seeing th…

Thanks for the feedback! You helped me a lot. I experimented on your command’s flags. Still sticking to my own build, not yours. Your big speed boost is due to using ‘-fa 1’. Testing with same build as in OP: -p 32000 -fa 0: 1234 PP t/s (same as before) -p 32000 -fa 1: 2158 (+74% compared to with…

Flash Attention reduces the amount of memory required to store KV cache (that’s why you are seeing performance boost!), so there is pretty much no reason to turn it off.

What prompt processing speed can one expect above 500k ctx?

Accelerated Computing DGX Spark / GB10 User Forum DGX Spark / GB10

raphael.amorim December 24, 2025, 3:28pm 4

Some blackwell optimizations coming for llama.cpp

Topic		Replies	Views
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	1701	December 22, 2025
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7877	March 31, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4309	March 16, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6063	March 28, 2026
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2373	March 26, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	459	March 3, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	15	2061	April 8, 2026
Nemotron-3-Super-120B at 20-22 tok/s Super Special Recipe DGX Spark / GB10 nemotron	3	290	April 5, 2026
Increasing artefact rate on growing context on DGX Spark (glm 4.7 flash) DGX Spark / GB10	12	241	February 4, 2026
Nemotron 3 Super: Updates Approaching Agentic Usability DGX Spark / GB10 llama , agentic-ai , nemotron	1	285	April 5, 2026

What prompt processing speed can one expect above 500k ctx?

Related topics