I see many people use vLLM for inference engine, while not many use llama.cpp. I wonder whether people has tried to build directly on the spark? If so, what build flags have people been using? GB10 with sm_121 is an interesting aspect.
Plenty of people here use llama-server, but vLLM is the fastest inference engine in many cases, so it obviously receives a lot of attention. I find that vLLM is a pain to work with because it takes several minutes to boot up, and I like to change models frequently.
I think the difference is mostly compile time, since 121 seems to translate to 121a in the llama.cpp repo, and 121a will include both virtual and real targets, so it will generated some extra portability code that isn’t needed. I don’t think it will affect runtime performance, but I have not tested lately.
Used your llama.cpp build guide, finished fairly fast. Did a few rounds of llama-bench on qwen3.5 122b. Surprised to find it is quite usable for Q4_K_M and even Q5_K_S