Hi everyone,
I’ve been on Mac (M-series) forever, using LM Studio / Metal. Everything was transparent: download, run, done.
Recently I decided to move to the “big leagues” for serious workloads and bought an NVIDIA DGX Spark. I expected to launch into orbit — instead I’ve spent the last three days breaking my brain trying to get to a clear, fast, stable setup.
This “brave new world” of the Linux ML stack is honestly overwhelming.
Instead of a single Start button, I got a LEGO set of knobs and “best practices” that change depending on the model, engine, quantization, and seemingly the phase of the moon:
-
vLLM vs llama.cpp vs TGI — what’s actually fastest and most stable on a single powerful node?
-
KV cache tweaks, PagedAttention, etc. — what’s worth turning on vs ignoring?
-
NUMA optimization, mlock, no-mmap — meaningful gains or micro-optimizations?
-
CUDA Graphs — enable for speed or just invite weird errors?
-
Flash Attention 2 — must-have or “it depends”?
Even Perplexity can’t give a straight answer on what the current industry standard is for maximum inference performance on one (but strong) machine.
Questions for people who’ve been through this
1) Engine: is vLLM worth the pain?
On one hand, vLLM is fast. On the other, it seems extremely aggressive with memory reservation.
-
I tried Nemotron 30B and it somehow ate ~110 GB of RAM (feels like KV-cache allocation behavior), and then running anything alongside it becomes unrealistic.
-
llama.cpp looks simpler and more “I can reason about it,” but how much performance am I giving up (especially for batching / throughput) compared to vLLM’s optimized CUDA kernels and attention optimizations?
2) Quantization: what should I actually use right now ?
My current understanding:
-
NVFP4 is basically a unicorn right now — mostly living inside unstable Docker setups (e.g., Avarok) and waiting for broader/cleaner support on Blackwell (GB10/200).
-
So for Hopper/Ada today, what’s the practical choice: AWQ 4-bit? GPTQ? MXFP4?
-
Where’s the real balance between “won’t run reliably” and “flies but quality drops too much”?
3) My workload
I’m building a local LightRAG / Knowledge Graph pipeline. I need maximum throughput for indexing and processing a large corpus.
Using Ollama for this feels like a downgrade and kind of defeats the point of buying powerful NVIDIA hardware — at that point I could’ve stayed on my Mac where everything “just worked.”
Bottom line
Can someone share the current “gold standard” setup?
What stack (Software + Model Format + Key Parameters) are you using in production or serious development to squeeze maximum performance out of NVIDIA, without turning into a full-time DevOps engineer?
Thanks in advance 🙏