This is hopefully a simple tutorial on compiling llama.cpp on the DGX Spark, once compiled, it can be used to run GGML-based LLM models directly on the command line, served as an OpenAI compatible API, or accessed via a web browser (which is what we’ll be doing for this tutorial).
As of 25 November 2025, all build tools and dependencies needed to compile llama.cpp is already installed on the DGX Spark, so you can literally just clone the repository and build it.
Since Qwen3 235B is slightly too big at 4bit quantisation for the memory on the Spark, so we’ll be using a roughly 3.7 bit per weight version, which still works extremely well based on my testing.
To start off, clone the llama.cpp repository:
$ git clone https://github.com/ggml-org/llama.cpp.git
$ cd llama.cpp
Now build it (using all 20 of those wonderful cores you have access to):
$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
$ cmake --build build --config Release -j 20
Once built, you can find the binary files in the ‘build/bin’ directory. You can copy these binaries wherever you want, or change directly into that directory, which is what I’ll do for this tutorial:
$ cd build/bin
Download the three parts of ‘Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf’. The download is around 103GB, but you can load any other GGML model if you’d prefer running something smaller:
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part1of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part2of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part3of3
Merge the downloaded files into a single GGUF file:
$ cat Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part* > Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf
I’m not sure if it’s needed, but it’s probably a good idea to clear the caches on the DGX Spark before you attempt to load this model as it is quite large:
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
Now you should be able to start the server and load the model:
$ ./llama-server --no-mmap --jinja --host 0.0.0.0 --port 5000 --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf
Once loaded, you should be able to access the web server on port 5000 (which is changeable in the parameters above). For example, my DGX Spark is simply named dgx, so I can access the web app at: http://dgx.local:5000
For reference, I’m getting around 15 tokens/second on this model, using around 107GB of the Spark’s memory.
There are bunch of different binaries provided with llama.cpp, for example, if you want to chat directly in the terminal, you can use llama-cli as below:
$ ./llama-cli --no-mmap --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf