Tutorial: Build llama.cpp from source and run Qwen3 235B

This is hopefully a simple tutorial on compiling llama.cpp on the DGX Spark, once compiled, it can be used to run GGML-based LLM models directly on the command line, served as an OpenAI compatible API, or accessed via a web browser (which is what we’ll be doing for this tutorial).

As of 25 November 2025, all build tools and dependencies needed to compile llama.cpp is already installed on the DGX Spark, so you can literally just clone the repository and build it.

Since Qwen3 235B is slightly too big at 4bit quantisation for the memory on the Spark, so we’ll be using a roughly 3.7 bit per weight version, which still works extremely well based on my testing.

To start off, clone the llama.cpp repository:

$ git clone https://github.com/ggml-org/llama.cpp.git
$ cd llama.cpp

Now build it (using all 20 of those wonderful cores you have access to):

$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
$ cmake --build build --config Release -j 20

Once built, you can find the binary files in the ‘build/bin’ directory. You can copy these binaries wherever you want, or change directly into that directory, which is what I’ll do for this tutorial:

$ cd build/bin

Download the three parts of ‘Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf’. The download is around 103GB, but you can load any other GGML model if you’d prefer running something smaller:

$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part1of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part2of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part3of3

Merge the downloaded files into a single GGUF file:

$ cat Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part* > Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

I’m not sure if it’s needed, but it’s probably a good idea to clear the caches on the DGX Spark before you attempt to load this model as it is quite large:

$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

Now you should be able to start the server and load the model:

$ ./llama-server --no-mmap --jinja --host 0.0.0.0 --port 5000 --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

Once loaded, you should be able to access the web server on port 5000 (which is changeable in the parameters above). For example, my DGX Spark is simply named dgx, so I can access the web app at: http://dgx.local:5000

For reference, I’m getting around 15 tokens/second on this model, using around 107GB of the Spark’s memory.

There are bunch of different binaries provided with llama.cpp, for example, if you want to chat directly in the terminal, you can use llama-cli as below:

$ ./llama-cli --no-mmap --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

4 Likes

A few notes on this:

  1. I’d recommend Unsloth’s Q3_K_XL quant to run on a single Spark (or Q4_K_XL on dual). It is bigger in size, but still fits in a single Spark and uses Unsloth’s Dynamic Quants where some layers are quantized at higher bits => improves overall quality.
  2. You don’t have to pre-download the weights, just use -hf switch (it also correctly initializes MM projector if it’s a vision-enabled model).
  3. You don’t need to merge the splits into a single file - llama.cpp is smart enough to pick all of them, just point to the first one.
  4. Don’t overlook --no-mmap, otherwise model loading will take much longer.
  5. -ngl 999 and -fa 1 are important too, as they ensure full model offloading to GPU and enable Flash Attention. It may or may not be a default now, but always good to set.

Example of -hf switch:

build/bin/llama-server -hf unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF:Q3_K_XL -ngl 999 --jinja -c 64000 --host 0.0.0.0 -fa 1 --no-mmap

Example of pointing only to one tensor split (for example, when running llama-bench):

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-VL-235B-A22B-Instruct-GGUF_UD-Q3_K_XL_Qwen3-VL-235B-A22B-Instruct-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

EDIT: also, no need to drop caches for llama.cpp - it can handle unified memory just fine.

5 Likes

Thanks for your inputs. And yes, -ngl 999 and -fa 1 are defaults now.

Please forgive my ignorance, but I always run into CUDA version issues — with the implication that the Spark’s standard CUDA 13 is too new. I’ve updated to what should be standard config as of 11/26/2025. (I can include logs as necessary, but didn’t want to flood anyone if I’m just doing something obviously stupid.)

Problem solved: The “obviously stupid” thing I did was to have an old nvcc installed via “apt install”. Even though it wasn’t at the front of my path, cmake put it there and hilarity ensued. This is all probably a holdover from bad advice someone gave me on day one, and I just never paid attention and cleaned it up again. After uninstall, everything is good.

2 Likes

Glad you got it sorted, but let me give you some advice: you shouldn’t need to mess with your CUDA installation, nvcc, or any other core NVIDIA libraries that are already installed on the system. CUDA is backwards compatible, so even with a CUDA 13 installation, it should still support CUDA 12.8 and lower. It should not be necessary to mess with it, and you’re probably just going to create pain for yourself if you try.

1 Like

Just for everyone’s benefit, if you do want to use the ability to download directly from huggingface, I’m pretty sure you have to compile llama.cpp with curl support. To do that you need to install libcurl:

sudo apt-get install libcurl4-openssl-dev

And then, when building, obviously turn the curl flag on:

$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
$ cmake --build build --config Release -j 20

Then it should work and pull things from huggingface. I don’t personally use or plan to use this feature, but I get that it’s convenient for some people. I’m weird in that I like manually managing my files and not having things in weird places like hidden folders and stuff.