Tutorial: Build llama.cpp from source and run Qwen3 235B

This is hopefully a simple tutorial on compiling llama.cpp on the DGX Spark, once compiled, it can be used to run GGML-based LLM models directly on the command line, served as an OpenAI compatible API, or accessed via a web browser (which is what we’ll be doing for this tutorial).

As of 25 November 2025, all build tools and dependencies needed to compile llama.cpp is already installed on the DGX Spark, so you can literally just clone the repository and build it.

Since Qwen3 235B is slightly too big at 4bit quantisation for the memory on the Spark, so we’ll be using a roughly 3.7 bit per weight version, which still works extremely well based on my testing.

To start off, clone the llama.cpp repository:

$ git clone https://github.com/ggml-org/llama.cpp.git
$ cd llama.cpp

Now build it (using all 20 of those wonderful cores you have access to):

$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF
$ cmake --build build --config Release -j 20

Once built, you can find the binary files in the ‘build/bin’ directory. You can copy these binaries wherever you want, or change directly into that directory, which is what I’ll do for this tutorial:

$ cd build/bin

Download the three parts of ‘Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf’. The download is around 103GB, but you can load any other GGML model if you’d prefer running something smaller:

$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part1of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part2of3
$ wget https://huggingface.co/mradermacher/Qwen3-235B-A22B-Thinking-2507-i1-GGUF/resolve/main/Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part3of3

Merge the downloaded files into a single GGUF file:

$ cat Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf.part* > Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

I’m not sure if it’s needed, but it’s probably a good idea to clear the caches on the DGX Spark before you attempt to load this model as it is quite large:

$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

Now you should be able to start the server and load the model:

$ ./llama-server --no-mmap --jinja --host 0.0.0.0 --port 5000 --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

Once loaded, you should be able to access the web server on port 5000 (which is changeable in the parameters above). For example, my DGX Spark is simply named dgx, so I can access the web app at: http://dgx.local:5000

For reference, I’m getting around 15 tokens/second on this model, using around 107GB of the Spark’s memory.

There are bunch of different binaries provided with llama.cpp, for example, if you want to chat directly in the terminal, you can use llama-cli as below:

$ ./llama-cli --no-mmap --ctx-size 32768 --model Qwen3-235B-A22B-Thinking-2507.i1-IQ3_M.gguf

4 Likes

A few notes on this:

  1. I’d recommend Unsloth’s Q3_K_XL quant to run on a single Spark (or Q4_K_XL on dual). It is bigger in size, but still fits in a single Spark and uses Unsloth’s Dynamic Quants where some layers are quantized at higher bits => improves overall quality.
  2. You don’t have to pre-download the weights, just use -hf switch (it also correctly initializes MM projector if it’s a vision-enabled model).
  3. You don’t need to merge the splits into a single file - llama.cpp is smart enough to pick all of them, just point to the first one.
  4. Don’t overlook --no-mmap, otherwise model loading will take much longer.
  5. -ngl 999 and -fa 1 are important too, as they ensure full model offloading to GPU and enable Flash Attention. It may or may not be a default now, but always good to set.

Example of -hf switch:

build/bin/llama-server -hf unsloth/Qwen3-VL-235B-A22B-Instruct-GGUF:Q3_K_XL -ngl 999 --jinja -c 64000 --host 0.0.0.0 -fa 1 --no-mmap

Example of pointing only to one tensor split (for example, when running llama-bench):

build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-VL-235B-A22B-Instruct-GGUF_UD-Q3_K_XL_Qwen3-VL-235B-A22B-Instruct-UD-Q3_K_XL-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

EDIT: also, no need to drop caches for llama.cpp - it can handle unified memory just fine.

6 Likes

Thanks for your inputs. And yes, -ngl 999 and -fa 1 are defaults now.

Please forgive my ignorance, but I always run into CUDA version issues — with the implication that the Spark’s standard CUDA 13 is too new. I’ve updated to what should be standard config as of 11/26/2025. (I can include logs as necessary, but didn’t want to flood anyone if I’m just doing something obviously stupid.)

Problem solved: The “obviously stupid” thing I did was to have an old nvcc installed via “apt install”. Even though it wasn’t at the front of my path, cmake put it there and hilarity ensued. This is all probably a holdover from bad advice someone gave me on day one, and I just never paid attention and cleaned it up again. After uninstall, everything is good.

2 Likes

Glad you got it sorted, but let me give you some advice: you shouldn’t need to mess with your CUDA installation, nvcc, or any other core NVIDIA libraries that are already installed on the system. CUDA is backwards compatible, so even with a CUDA 13 installation, it should still support CUDA 12.8 and lower. It should not be necessary to mess with it, and you’re probably just going to create pain for yourself if you try.

2 Likes

Just for everyone’s benefit, if you do want to use the ability to download directly from huggingface, I’m pretty sure you have to compile llama.cpp with curl support. To do that you need to install libcurl:

sudo apt-get install libcurl4-openssl-dev

And then, when building, obviously turn the curl flag on:

$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
$ cmake --build build --config Release -j 20

Then it should work and pull things from huggingface. I don’t personally use or plan to use this feature, but I get that it’s convenient for some people. I’m weird in that I like manually managing my files and not having things in weird places like hidden folders and stuff.

I am encountering the same issue: CMake invokes /usr/bin/nvcc even though /usr/local/cuda/bin/nvcc is present in the PATH. Below is the first part of the error message:

CMake Error at /usr/share/cmake-3.28/Modules/CMakeDetermineCompilerId.cmake:780 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.
  Compiler: /usr/bin/nvcc
....

Indeed, two different nvcc binaries exist on the system, and the one that appears in the PATH is not used by CMake.

sparku@spark:~/llama.cpp$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Aug_20_01:57:39_PM_PDT_2025
Cuda compilation tools, release 13.0, V13.0.88
Build cuda_13.0.r13.0/compiler.36424714_0

The location of this nvcc is:

sparku@spark:~/llama.cpp$ which nvcc
/usr/local/cuda/bin/nvcc

CMake, however, picks up the older version located in /usr/bin, which fails to compile:

sparku@spark:~/llama.cpp$ /usr/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:42_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

What would be the best solution?

I don’t think you’re supposed to have a /usr/bin/nvcc, you must have installed it somehow. The one it’s supposed to use is: /usr/local/cuda/bin/nvcc

I also took the latest update to double check everything and it hasn’t added that second copy of nvcc. I can also confirm it still builds correctly with the latest DGX Spark update applied.

2 Likes

I used

$ cmake -B build -DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc -DLLAMA_CURL=ON -DGGML_CUDA=ON

and everything was ok, thanks for your joined effort. Is there a reason, NIVIDA didn’t set this path per default on the Spark?

1 Like

You shouldn’t have to set that parameter at all, as it is already the default on the Spark. If you needed to set the location to nvcc manually, then you’ve obviously broken something, or you’re using a software image from one of the other vendors and they’ve broken something.

1 Like

Thank you, nice fix !

1 Like

Yes, that’s what I thought. At least that the CUDA environment is cleanly integrated. But not in the recovery image (dgx-spark-recovery-image-1.105.17.tar.gz) and all updates, where it seems to have been forgotten. One gets the feeling that the NVIDIA DGX department is a little too vain for Spark?
So, for our weight watchers spark self-help group (who has the same problem with the recovery image), you could add:

1 Like

As a fun addition to the topic, the largest I’ve run on a single Spark is Llama 3.1 405B. Being a dense model, it is pretty slow though at around 2 tokens per second, but it works, and essentially shows what is possible should an MoE model of that size show up down the road.

More specifically, for those interested, I ran the IQ2_XXS quant with a context size of 16K from:

To be clear, Qwen3 235B is a far more practical model to run on the Spark, it’s fairly close to a 4bit quant, especially as it’s not just blindly quantised, but a dynamic quant that performs quite well, it’s also quite fast at 14-15 tokens per second. So this is more a fun addition than a model I would recommend.

2 Likes

Yeah, both Qwen3-VL 235B and MiniMax M2 are quite usable even on a single Spark in Q3_K_XL quant from Unsloth.

1 Like

That MiniMax M2 sounds awesome. I’ll try that as a code copilot.

Yeah, I use it in VSCode Cline and it works really well.

I’m definitely going to check that out as I’ve been trying to find an alternative to GitHub Copilot ever since I got the Spark, and everyone seems to recommend Continue.dev, even the NVIDIA tutorial, but I just find it doesn’t work very well. Like when it lists files, it just lists my / directory, not the project’s root, and from there it obviously can’t find any project files, it’s just awful. I’ve been considering to just resubscribe to GitHub, if I could find a workable alternative that’d be awesome.

It was good back in the day, but since the rise of agentic AI, Cline and its clones (Roo Code, Kilo Code and others) has taken over. Or you can use Claude Code / Codex or Open Code. Aider works well too, but it’s more “old school” now. For VSCode I switch between Cline and Roo Code. You can still use continue.dev for autocomplete though.

I was messing around with Continue until 4am last night, I assumed, being NVIDIA’s choice for their coding assistant guide, it would be the best, but it’s just awful, I never got it to work properly. Tool execution would constantly fail no matter what model I was using, it wouldn’t seem to have any sense of where my project is located (even though VS Code has my project directory open), so it would keep searching from root with no idea what’s going on, it was a really terrible experience. I’m just trying Cline now, so far it seems good (in the sense that it actually works).

Well, it was just playbook example. They also chose Ollama for one of their playbooks.