Building llama.cpp container images for Spark/GB10

Hi!

For those who like to run $THINGS in containers, I tried to find a way to build a docker image for llama.cpp as there are currently no containers at all for arm64 - only for amd64. [1]

[1] see open issue for details [Tracker] Docker build fails on CI for arm64 · Issue #11888 · ggml-org/llama.cpp · GitHub

So I tried to find out how the normal build process looks like and why it is failing for arm64 and/or how to get it running for our GB10s.

The standard docker files for are located in a folder .devops of the official repo. There is also one for CUDA, but that fails for GB10 out of the box. The main reason for that is a wrong LD_LIBRARY_PATH, so you will get an error:

#14 92.07 [ 62%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
#14 92.23 [ 62%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
#14 92.33 [ 62%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
#14 92.38 [ 62%] Linking CXX executable ../../bin/llama-simple
#14 92.44 /usr/bin/ld: warning: libcuda.so.1, needed by ../../bin/libggml-cuda.so.0.9.4, not found (try using -rpath or -rpath-link)
#14 92.45 [ 62%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o

In the devel container used by llama.cpp nvidia/cuda:13.0.2-devel-ubuntu24.04 the LD_LIBRARY_PATH points to:

root@3e1024dcf4f9:$ env|grep LIBRARY
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

The library needed is to be found in /usr/local/cuda-13/compat - so you need to adjust the ENV for that container.

So just add

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

for the build section. And I added the CUDA architecture for cmake. If not specified cmake tries to build for all visible architectures (if I understood the docs correctly). But the normal docker build process does not have access to the GPU while building. You can change this using buildkit defining a builder with GPU support (see Container Device Interface (CDI) | Docker Docs) which is much too complicated, but may be useful in other projects.

The modified dockerfile can be found here llama.cpp Dockerfile for DGX Spark / GB10 · GitHub

So all steps to build a server container image:

mkdir src
cd src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp/.devops
wget https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile
cd ..
docker build -f .devops/spark.Dockerfile --target server -t llama.cpp:server-spark .

Hope that saves other some time while trying to build $THINGS which are built similar.

May be I can push that upstream - so the llama.cpp team will integrate it into their build process.

Feedback welcome.

3 Likes

@cosinus thanks a lot !

here are the instructions !

1️⃣ Build a GB10-compatible llama.cpp Docker image

Goal: have a llama.cpp:server-spark image that works correctly on your NVIDIA GB10 (arm64).

# 1. Create a working directory
mkdir -p ~/src
cd ~/src

# 2. Clone the official llama.cpp repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# 3. Go into the .devops folder
cd .devops

# 4. Download the special Dockerfile for DGX Spark / GB10
wget "https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile"

# 5. Go back to repo root
cd ..

Then build the server image with CUDA + GB10 fix:

docker build \
  -f .devops/spark.Dockerfile \
  --target server \
  -t llama.cpp:server-spark .

This image:

  • Uses nvidia/cuda:13.0.2-devel-ubuntu24.04 as base

  • Fixes LD_LIBRARY_PATH to include /usr/local/cuda-13/compat

  • Builds llama-server for arm64 + GB10.


2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

We chose:

unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf

On your Spark:

mkdir -p /home/user/models
cd /home/user/models

wget "https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/resolve/main/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf" \
  -O mistral-small-3.2-24b-ud-q4_k_xl.gguf

ls -lh /home/user/models

You should see the GGUF file in that folder.


3️⃣ Run llama.cpp server container on port 3010

We then started a container using the image you built and the model you downloaded:

docker run -d \
  --name llama-spark-mistral32 \
  --gpus all \
  -p 3010:8080 \
  -v /home/user/models:/models \
  llama.cpp:server-spark \
    --host 0.0.0.0 \
    --port 8080 \
    -m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf \
    --ctx-size 16384 \
    --threads -1 \
    --n-gpu-layers 99 \
    --flash-attn auto

Check it’s running:

docker ps | grep llama

You should see llama-spark-mistral32 in Up state.


4️⃣ Test the HTTP API on the Spark

Still on the Spark:

curl -s http://localhost:3010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-3.2-24b-ud-q4_k_xl.gguf",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Say a very short sentence in English." }
    ],
    "max_tokens": 64,
    "temperature": 0.4
  }'

From your Mac, you then call the same endpoint using the Spark IP:

LLAMA_URL = "http://xx.xx.xx.xx:3010/v1/chat/completions"
LLAMA_MODEL = "mistral-small-3.2-24b-ud-q4_k_xl.gguf"


5️⃣ Update your prompt generation script (mascot prompts)

We replaced:

  • OLLAMA_URL → with LLAMA_URL pointing to http://xx.xx.xx.xx:3010/v1/chat/completions

  • model: "gpt-oss:20b" → with model: "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

2 Likes

@cosinus now big question ! what is justifying from the first launch in llama.cpp → 90-95% of the GPU usage….

have a look !

answering to my-self ;)



docker run -d 
–name llama-spark-mistral32 
–gpus all 
-p 3010:8080 
-v /home/user/models:/models 
llama.cpp:server-spark 
–host 0.0.0.0 
–port 8080 
-m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf 
–ctx-size 4096 
–threads -1 
–n-gpu-layers 16 
–flash-attn auto

better by limiting the gpu layer to 16…

around 27-30% of GPU use ;)

You should see this behavior on every GPU (>90% utilization). Whenever you fire a request the GPU usage goes up to near 100% as long as your requests is being processed. After it is finished it goes down to zero again - assuming that it is a single user, sending its requests one by one. GPUs in use by multiple user might never go down to zero for a long(er) time. 😅

If you install nvtop (needs a patched version for Spark[1]) you will see something like this:

while running and when finished:

GPUs are designed for massive parallelism, meaning their thousands of cores are meant to be used all at once. That’s what makes them so fast compared to CPUs. Tasks split into many smaller pieces which can be done in parallel.

[1] NVTOP with DGX Spark unified memory support

On Spark, you always want to set --n-gpu-layers or -ngl (same parameter) to a large number (999 is a good one), so ALL layers are processed by GPU. There is no point in offloading any layers to CPU as Spark uses unified memory architecture. You will just lose performance.

In the special spark.Dockerfile, I don’t see “compat” directory in /usr/local/cuda-13/. What should I change this line to?

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

This is what I am seeing in /usr/local/cuda-13/

Where do you look?

The directory is in the nvidia/cuda:13.0.2-devel-ubuntu24.04 build container:

docker run --rm -it nvidia/cuda:13.0.2-devel-ubuntu24.04 bash

and inside the container you should see:

root@368fd6ab7832:/# ls -la /usr/local/cuda-13/compat/
total 294916
drwxr-xr-x 2 root root     4096 Oct 10 16:59 .
drwxr-xr-x 1 root root     4096 Oct 10 17:20 ..
lrwxrwxrwx 1 root root       12 Sep 23 15:46 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Sep 23 15:46 libcuda.so.1 -> libcuda.so.580.95.05
-rw-r--r-- 1 root root 96452352 Sep 23 09:58 libcuda.so.580.95.05
lrwxrwxrwx 1 root root       28 Sep 23 15:46 libcudadebugger.so.1 -> libcudadebugger.so.580.95.05
-rw-r--r-- 1 root root  9695008 Sep 23 09:30 libcudadebugger.so.580.95.05
-rw-r--r-- 1 root root 58730832 Sep 23 10:35 libnvidia-gpucomp.so.580.95.05
lrwxrwxrwx 1 root root       19 Sep 23 15:46 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root       27 Sep 23 15:46 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.580.95.05
-rw-r--r-- 1 root root 75890072 Sep 23 10:14 libnvidia-nvvm.so.580.95.05
-rw-r--r-- 1 root root 21476008 Sep 15 10:16 libnvidia-nvvm70.so.4
lrwxrwxrwx 1 root root       37 Sep 23 15:46 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.580.95.05
-rw-r--r-- 1 root root 39731968 Sep 23 09:56 libnvidia-ptxjitcompiler.so.580.95.05

If you use my spark.Dockerfile it should do its job and produce a ready to use container.

I’d like a single image/container that includes both llama‑server and llama‑swap. I tried adding llama‑swap to your image, but the llama‑swap build fails—most likely because I’m using the wrong base image. Do you have any suggestions?

FROM golang:1.22-alpine AS llama_swap_build

RUN apk add --no-cache \
git \
build-base \
nodejs \
npm

WORKDIR /src/llama-swap

RUN git clone https://github.com/mostlygeek/llama-swap.git . && \
make clean all NO_UI=1 NO_MAC=1

Normally it is best practice to build one container for each service. Then you compose your software stack with those containers – via docker compose as the easiest solution.

You could use the same approach the author of llama-swap did in his Dockerfile.

He is using the llama-cpp container as base image and copies just the binary into the base image. I would follow his approach - use the ready made arm64 binary instead.

Otherwise would need different built environments/base images if you want to build everything on your own.

1 Like

I did too. GPT wrote this and it has worked well:

Dockerfile:

# ---------- Global build args (pin versions here) ----------
ARG LLAMA_CPP_TAG=<#>           # https://github.com/ggml-org/llama.cpp/releases
ARG LLAMA_SWAP_TAG=<#>          # https://github.com/mostlygeek/llama-swap/releases
ARG CUDA_TAG=13.0.0-devel-ubuntu22.04

# ---------- Stage 1: build llama.cpp (CUDA) for ARM64 ----------
FROM nvidia/cuda:${CUDA_TAG} AS build

# Re-declare args inside the stage where they’re used
ARG LLAMA_CPP_TAG

RUN apt-get update && apt-get install -y --no-install-recommends \
    git cmake build-essential curl ca-certificates pkg-config libcurl4-openssl-dev && \
    rm -rf /var/lib/apt/lists/*

# Find CUDA target dir on ARM (aarch64-linux or sbsa-linux) and add stub links for libcuda
RUN set -eux; \
  for d in /usr/local/cuda/targets/aarch64-linux /usr/local/cuda/targets/sbsa-linux; do \
    if [ -d "$d/lib" ]; then CUDA_LIB_DIR="$d/lib"; break; fi; \
  done; \
  test -n "${CUDA_LIB_DIR:-}" || (echo "Could not find CUDA targets/*/lib" && exit 1); \
  ln -sf "$CUDA_LIB_DIR/stubs/libcuda.so" "$CUDA_LIB_DIR/libcuda.so"; \
  ln -sf "$CUDA_LIB_DIR/stubs/libcuda.so" "$CUDA_LIB_DIR/libcuda.so.1"

# Clone llama.cpp at the pinned tag
RUN git clone --depth=1 --branch "${LLAMA_CPP_TAG}" https://github.com/ggml-org/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp

# Configure CUDA build; build the server; disable tests/examples
RUN set -eux; \
  for d in /usr/local/cuda/targets/aarch64-linux /usr/local/cuda/targets/sbsa-linux; do \
    if [ -d "$d/lib" ]; then CUDA_LIB_DIR="$d/lib"; break; fi; \
  done; \
  cmake -S . -B build \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TESTS=OFF \
    -DLLAMA_BUILD_EXAMPLES=OFF \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_LIBRARY_PATH="${CUDA_LIB_DIR}/stubs" \
    -DCMAKE_CUDA_ARCHITECTURES="121a-real" \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath,${CUDA_LIB_DIR} -Wl,-rpath-link,${CUDA_LIB_DIR}/stubs" && \
  cmake --build build -j

# Export artifacts: server + all shared libs produced in build/bin
RUN set -eux; \
  mkdir -p /out/bin /out/lib; \
  install -Dm755 build/bin/llama-server /out/bin/llama-server; \
  find build/bin -maxdepth 1 -type f -name "*.so*" -exec install -Dm755 {} /out/lib/ \; ; \
  ls -l /out/bin /out/lib

# ---------- Stage 2: runtime with CUDA + llama-swap + llama-server ----------
FROM nvidia/cuda:13.0.0-runtime-ubuntu22.04

# Re-declare args used in this stage
ARG LLAMA_SWAP_TAG
ARG SWAP_ARCH=linux_arm64

# Runtime deps (curl for fetching llama-swap; libcurl4 for CURL-enabled server; libgomp1 for OpenMP)
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates curl libgomp1 libcurl4 && \
    rm -rf /var/lib/apt/lists/*

# Fetch llama-swap from the pinned tag; strip leading 'v' for the asset filename
RUN set -eux; \
    ver_no_v="${LLAMA_SWAP_TAG#v}"; \
    curl -L -o /tmp/llama-swap.tgz \
      "https://github.com/mostlygeek/llama-swap/releases/download/${LLAMA_SWAP_TAG}/llama-swap_${ver_no_v}_${SWAP_ARCH}.tar.gz"; \
    tar -xzf /tmp/llama-swap.tgz -C /usr/local/bin llama-swap; \
    chmod +x /usr/local/bin/llama-swap; \
    rm -f /tmp/llama-swap.tgz

# Add llama-server and its shared libs from the build stage
COPY --from=build /out/bin/llama-server /usr/local/bin/llama-server
COPY --from=build /out/lib/ /usr/local/lib/

# Ensure the loader can find both our libs and CUDA's libs
RUN echo "/usr/local/lib" > /etc/ld.so.conf.d/llama.conf && ldconfig
ENV LD_LIBRARY_PATH=/usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# NVIDIA runtime (host mounts real libcuda at run-time)
ENV NVIDIA_VISIBLE_DEVICES=all \
    NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app
EXPOSE 8080

# llama-swap reads /app/config.yaml (mount it there via Compose)
CMD ["llama-swap", "--config", "/app/config.yaml", "--listen", "0.0.0.0:8080"]

And this Compose YAML:

configs:
  llama-swap-config:
    content: |
      healthCheckTimeout: 120
      startPort: 20001

      macros:
        latest-llama: >
          llama-server

      models:
        <fillThisOut>

networks:
  local:
    external: true
    name: local

services:
  llama:
    container_name: llama
    image: llama:1
    networks:
      - local
    ports:
      - "8080:8080"
    restart: unless-stopped
    environment:
      OMP_NUM_THREADS: "4"
      NVIDIA_VISIBLE_DEVICES: "0"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility,graphics"
      # Do not set both to true
      GGML_CUDA_FORCE_MMQ: "1"
      GGML_CUDA_FORCE_CUBLAS: "0"
    volumes:
      - /path/to/models:/models:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    configs:
      - source: llama-swap-config
        target: /app/config.yaml

volumes:
  llama-cache:
2 Likes