Single node and Dual node llama.cpp build flag

I see many people use vLLM for inference engine, while not many use llama.cpp. I wonder whether people has tried to build directly on the spark? If so, what build flags have people been using? GB10 with sm_121 is an interesting aspect.

cmake -B build
-DGGML_CUDA=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DCMAKE_CUDA_ARCHITECTURES=121
-DGGML_NATIVE=ON

cmake -S . -B build -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_CUDA_ARCHITECTURES=121a-real &&
  time cmake --build build -j"$(nproc)"

Plenty of people here use llama-server, but vLLM is the fastest inference engine in many cases, so it obviously receives a lot of attention. I find that vLLM is a pain to work with because it takes several minutes to boot up, and I like to change models frequently.

Thank you!

I will also check out vLLM deployment.

Oh, for DCMAKE_CUDA_ARCHITECTURES, is 121 not enough? Must specify 121a-real?

I think the difference is mostly compile time, since 121 seems to translate to 121a in the llama.cpp repo, and 121a will include both virtual and real targets, so it will generated some extra portability code that isn’t needed. I don’t think it will affect runtime performance, but I have not tested lately.

Used your llama.cpp build guide, finished fairly fast. Did a few rounds of llama-bench on qwen3.5 122b. Surprised to find it is quite usable for Q4_K_M and even Q5_K_S

model size params backend ngl threads type_k type_v fa test t/s
Unsloth-qwen35moe 122B.A10B Q4_K - Medium 71.73 GiB 122.11 B CUDA 99 16 q8_0 q8_0 1 pp512 466.52 ± 18.61
Unsloth-qwen35moe 122B.A10B Q4_K - Medium 71.73 GiB 122.11 B CUDA 99 16 q8_0 q8_0 1 tg128 17.36 ± 0.18
bartowski-qwen35moe 122B.A10B Q4_K - Medium 69.83 GiB 122.11 B CUDA 999 16 q8_0 q8_0 1 pp512 507.40 ± 14.78
bartowski-qwen35moe 122B.A10B Q4_K - Medium 69.83 GiB 122.11 B CUDA 999 16 q8_0 q8_0 1 tg128 19.03 ± 0.20
bartowski-qwen35moe 122B.A10B Q5_K - Small 78.97 GiB 122.11 B CUDA 999 16 q8_0 q8_0 1 pp512 467.42 ± 19.62
bartowski-qwen35moe 122B.A10B Q5_K - Small 78.97 GiB 122.11 B CUDA 999 16 q8_0 q8_0 1 tg128 18.50 ± 0.12