Single node and Dual node llama.cpp build flag

xkm121 · March 11, 2026, 2:55pm

I see many people use vLLM for inference engine, while not many use llama.cpp. I wonder whether people has tried to build directly on the spark? If so, what build flags have people been using? GB10 with sm_121 is an interesting aspect.

cmake -B build
-DGGML_CUDA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DCMAKE_CUDA_ARCHITECTURES=121
-DGGML_NATIVE=ON

coder543 · March 11, 2026, 3:15pm

cmake -S . -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121a-real &&
  time cmake --build build -j"$(nproc)"

Plenty of people here use llama-server, but vLLM is the fastest inference engine in many cases, so it obviously receives a lot of attention. I find that vLLM is a pain to work with because it takes several minutes to boot up, and I like to change models frequently.

xkm121 · March 11, 2026, 3:17pm

Thank you!

I will also check out vLLM deployment.

xkm121 · March 11, 2026, 3:29pm

Oh, for DCMAKE_CUDA_ARCHITECTURES, is 121 not enough? Must specify 121a-real?

coder543 · March 11, 2026, 4:01pm

I think the difference is mostly compile time, since 121 seems to translate to 121a in the llama.cpp repo, and 121a will include both virtual and real targets, so it will generated some extra portability code that isn’t needed. I don’t think it will affect runtime performance, but I have not tested lately.

xkm121 · March 11, 2026, 7:01pm

Used your llama.cpp build guide, finished fairly fast. Did a few rounds of llama-bench on qwen3.5 122b. Surprised to find it is quite usable for Q4_K_M and even Q5_K_S

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
Unsloth-qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	CUDA	99	16	q8_0	q8_0	1	pp512	466.52 ± 18.61
Unsloth-qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	CUDA	99	16	q8_0	q8_0	1	tg128	17.36 ± 0.18
bartowski-qwen35moe 122B.A10B Q4_K - Medium	69.83 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	pp512	507.40 ± 14.78
bartowski-qwen35moe 122B.A10B Q4_K - Medium	69.83 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	tg128	19.03 ± 0.20
bartowski-qwen35moe 122B.A10B Q5_K - Small	78.97 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	pp512	467.42 ± 19.62
bartowski-qwen35moe 122B.A10B Q5_K - Small	78.97 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	tg128	18.50 ± 0.12

Topic		Replies	Views
Compiling llama.cpp DGX Spark / GB10 llama	14	1386	February 7, 2026
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	5959	January 20, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2089	February 25, 2026
Help on llama.cpp command line arguments and compilation settings (performance testing included) DGX Spark / GB10 performance , generative_ai , llama , nemotron	7	1237	January 9, 2026
Best LLM engine for several parallel models? DGX Spark / GB10 agentic-ai	6	607	January 6, 2026
(sparkrun) Qwen3.5 GGUF Benchmarks over llama.cpp RPC DGX Spark / GB10 Projects llama	3	591	March 11, 2026
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	13	1289	January 7, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	459	March 3, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2696	December 31, 2025
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	878	April 1, 2026

Single node and Dual node llama.cpp build flag

Related topics