Building llama.cpp container images for Spark/GB10

cosinus · December 5, 2025, 10:36am

Hi!

For those who like to run $THINGS in containers, I tried to find a way to build a docker image for llama.cpp as there are currently no containers at all for arm64 - only for amd64. [1]

[1] see open issue for details [Tracker] Docker build fails on CI for arm64 · Issue #11888 · ggml-org/llama.cpp · GitHub

So I tried to find out how the normal build process looks like and why it is failing for arm64 and/or how to get it running for our GB10s.

The standard docker files for are located in a folder .devops of the official repo. There is also one for CUDA, but that fails for GB10 out of the box. The main reason for that is a wrong LD_LIBRARY_PATH, so you will get an error:

#14 92.07 [ 62%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
#14 92.23 [ 62%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
#14 92.33 [ 62%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
#14 92.38 [ 62%] Linking CXX executable ../../bin/llama-simple
#14 92.44 /usr/bin/ld: warning: libcuda.so.1, needed by ../../bin/libggml-cuda.so.0.9.4, not found (try using -rpath or -rpath-link)
#14 92.45 [ 62%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o

In the devel container used by llama.cpp nvidia/cuda:13.0.2-devel-ubuntu24.04 the LD_LIBRARY_PATH points to:

root@3e1024dcf4f9:$ env|grep LIBRARY
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

The library needed is to be found in /usr/local/cuda-13/compat - so you need to adjust the ENV for that container.

So just add

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

for the build section. And I added the CUDA architecture for cmake. If not specified cmake tries to build for all visible architectures (if I understood the docs correctly). But the normal docker build process does not have access to the GPU while building. You can change this using buildkit defining a builder with GPU support (see Container Device Interface (CDI) | Docker Docs) which is much too complicated, but may be useful in other projects.

The modified dockerfile can be found here llama.cpp Dockerfile for DGX Spark / GB10 · GitHub

So all steps to build a server container image:

mkdir src
cd src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp/.devops
wget https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile
cd ..
docker build -f .devops/spark.Dockerfile --target server -t llama.cpp:server-spark .

Hope that saves other some time while trying to build $THINGS which are built similar.

May be I can push that upstream - so the llama.cpp team will integrate it into their build process.

Feedback welcome.

deeduckme · December 6, 2025, 1:09pm

@cosinus thanks a lot !

here are the instructions !

1️⃣ Build a GB10-compatible `llama.cpp` Docker image

Goal: have a llama.cpp:server-spark image that works correctly on your NVIDIA GB10 (arm64).

# 1. Create a working directory
mkdir -p ~/src
cd ~/src

# 2. Clone the official llama.cpp repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# 3. Go into the .devops folder
cd .devops

# 4. Download the special Dockerfile for DGX Spark / GB10
wget "https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile"

# 5. Go back to repo root
cd ..

Then build the server image with CUDA + GB10 fix:

docker build \
  -f .devops/spark.Dockerfile \
  --target server \
  -t llama.cpp:server-spark .

This image:

Uses nvidia/cuda:13.0.2-devel-ubuntu24.04 as base
Fixes LD_LIBRARY_PATH to include /usr/local/cuda-13/compat
Builds llama-server for arm64 + GB10.

2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

We chose:

unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf

On your Spark:

mkdir -p /home/user/models
cd /home/user/models

wget "https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/resolve/main/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf" \
  -O mistral-small-3.2-24b-ud-q4_k_xl.gguf

ls -lh /home/user/models

You should see the GGUF file in that folder.

3️⃣ Run `llama.cpp` server container on port 3010

We then started a container using the image you built and the model you downloaded:

docker run -d \
  --name llama-spark-mistral32 \
  --gpus all \
  -p 3010:8080 \
  -v /home/user/models:/models \
  llama.cpp:server-spark \
    --host 0.0.0.0 \
    --port 8080 \
    -m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf \
    --ctx-size 16384 \
    --threads -1 \
    --n-gpu-layers 99 \
    --flash-attn auto

Check it’s running:

docker ps | grep llama

You should see llama-spark-mistral32 in Up state.

4️⃣ Test the HTTP API on the Spark

Still on the Spark:

curl -s http://localhost:3010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-3.2-24b-ud-q4_k_xl.gguf",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Say a very short sentence in English." }
    ],
    "max_tokens": 64,
    "temperature": 0.4
  }'

From your Mac, you then call the same endpoint using the Spark IP:

LLAMA_URL = "http://xx.xx.xx.xx:3010/v1/chat/completions"
LLAMA_MODEL = "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

5️⃣ Update your prompt generation script (mascot prompts)

We replaced:

OLLAMA_URL → with LLAMA_URL pointing to http://xx.xx.xx.xx:3010/v1/chat/completions
model: "gpt-oss:20b" → with model: "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

deeduckme · December 6, 2025, 1:11pm

@cosinus now big question ! what is justifying from the first launch in llama.cpp → 90-95% of the GPU usage….

have a look !

deeduckme · December 6, 2025, 5:26pm

answering to my-self ;)



docker run -d 
–name llama-spark-mistral32 
–gpus all 
-p 3010:8080 
-v /home/user/models:/models 
llama.cpp:server-spark 
–host 0.0.0.0 
–port 8080 
-m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf 
–ctx-size 4096 
–threads -1 
–n-gpu-layers 16 
–flash-attn auto

better by limiting the gpu layer to 16…

around 27-30% of GPU use ;)

cosinus · December 7, 2025, 9:43am

You should see this behavior on every GPU (>90% utilization). Whenever you fire a request the GPU usage goes up to near 100% as long as your requests is being processed. After it is finished it goes down to zero again - assuming that it is a single user, sending its requests one by one. GPUs in use by multiple user might never go down to zero for a long(er) time. 😅

If you install nvtop (needs a patched version for Spark[1]) you will see something like this:

while running and when finished:

GPUs are designed for massive parallelism, meaning their thousands of cores are meant to be used all at once. That’s what makes them so fast compared to CPUs. Tasks split into many smaller pieces which can be done in parallel.

[1] NVTOP with DGX Spark unified memory support

eugr · December 11, 2025, 7:23pm

On Spark, you always want to set --n-gpu-layers or -ngl (same parameter) to a large number (999 is a good one), so ALL layers are processed by GPU. There is no point in offloading any layers to CPU as Spark uses unified memory architecture. You will just lose performance.

WillLee · December 18, 2025, 12:38am

In the special spark.Dockerfile, I don’t see “compat” directory in /usr/local/cuda-13/. What should I change this line to?

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

This is what I am seeing in /usr/local/cuda-13/

cosinus · December 18, 2025, 4:57pm

Where do you look?

The directory is in the nvidia/cuda:13.0.2-devel-ubuntu24.04 build container:

docker run --rm -it nvidia/cuda:13.0.2-devel-ubuntu24.04 bash

and inside the container you should see:

root@368fd6ab7832:/# ls -la /usr/local/cuda-13/compat/
total 294916
drwxr-xr-x 2 root root     4096 Oct 10 16:59 .
drwxr-xr-x 1 root root     4096 Oct 10 17:20 ..
lrwxrwxrwx 1 root root       12 Sep 23 15:46 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Sep 23 15:46 libcuda.so.1 -> libcuda.so.580.95.05
-rw-r--r-- 1 root root 96452352 Sep 23 09:58 libcuda.so.580.95.05
lrwxrwxrwx 1 root root       28 Sep 23 15:46 libcudadebugger.so.1 -> libcudadebugger.so.580.95.05
-rw-r--r-- 1 root root  9695008 Sep 23 09:30 libcudadebugger.so.580.95.05
-rw-r--r-- 1 root root 58730832 Sep 23 10:35 libnvidia-gpucomp.so.580.95.05
lrwxrwxrwx 1 root root       19 Sep 23 15:46 libnvidia-nvvm.so -> libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root       27 Sep 23 15:46 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.580.95.05
-rw-r--r-- 1 root root 75890072 Sep 23 10:14 libnvidia-nvvm.so.580.95.05
-rw-r--r-- 1 root root 21476008 Sep 15 10:16 libnvidia-nvvm70.so.4
lrwxrwxrwx 1 root root       37 Sep 23 15:46 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.580.95.05
-rw-r--r-- 1 root root 39731968 Sep 23 09:56 libnvidia-ptxjitcompiler.so.580.95.05

If you use my spark.Dockerfile it should do its job and produce a ready to use container.

ericcoco · January 4, 2026, 3:53pm

I’d like a single image/container that includes both llama‑server and llama‑swap. I tried adding llama‑swap to your image, but the llama‑swap build fails—most likely because I’m using the wrong base image. Do you have any suggestions?

FROM golang:1.22-alpine AS llama_swap_build

RUN apk add --no-cache \
git \
build-base \
nodejs \
npm

WORKDIR /src/llama-swap

RUN git clone https://github.com/mostlygeek/llama-swap.git . && \
make clean all NO_UI=1 NO_MAC=1

cosinus · January 4, 2026, 4:29pm

Normally it is best practice to build one container for each service. Then you compose your software stack with those containers – via docker compose as the easiest solution.

You could use the same approach the author of llama-swap did in his Dockerfile.

github.com/mostlygeek/llama-swap

docker/llama-swap.Containerfile

main

ARG BASE_IMAGE=ghcr.io/ggml-org/llama.cpp
ARG BASE_TAG=server-cuda
FROM ${BASE_IMAGE}:${BASE_TAG}

# has to be after the FROM
ARG LS_VER=170
ARG LS_REPO=mostlygeek/llama-swap

# Set default UID/GID arguments
ARG UID=10001
ARG GID=10001
ARG USER_HOME=/app

# Add user/group
ENV HOME=$USER_HOME
RUN if [ $UID -ne 0 ]; then \
      if [ $GID -ne 0 ]; then \
        groupadd --system --gid $GID app; \
      fi; \
      useradd --system --uid $UID --gid $GID \

This file has been truncated. show original

He is using the llama-cpp container as base image and copies just the binary into the base image. I would follow his approach - use the ready made arm64 binary instead.

Otherwise would need different built environments/base images if you want to build everything on your own.

jrsphd · January 4, 2026, 11:33pm

I did too. GPT wrote this and it has worked well:

Dockerfile:

# ---------- Global build args (pin versions here) ----------
ARG LLAMA_CPP_TAG=<#>           # https://github.com/ggml-org/llama.cpp/releases
ARG LLAMA_SWAP_TAG=<#>          # https://github.com/mostlygeek/llama-swap/releases
ARG CUDA_TAG=13.0.0-devel-ubuntu22.04

# ---------- Stage 1: build llama.cpp (CUDA) for ARM64 ----------
FROM nvidia/cuda:${CUDA_TAG} AS build

# Re-declare args inside the stage where they’re used
ARG LLAMA_CPP_TAG

RUN apt-get update && apt-get install -y --no-install-recommends \
    git cmake build-essential curl ca-certificates pkg-config libcurl4-openssl-dev && \
    rm -rf /var/lib/apt/lists/*

# Find CUDA target dir on ARM (aarch64-linux or sbsa-linux) and add stub links for libcuda
RUN set -eux; \
  for d in /usr/local/cuda/targets/aarch64-linux /usr/local/cuda/targets/sbsa-linux; do \
    if [ -d "$d/lib" ]; then CUDA_LIB_DIR="$d/lib"; break; fi; \
  done; \
  test -n "${CUDA_LIB_DIR:-}" || (echo "Could not find CUDA targets/*/lib" && exit 1); \
  ln -sf "$CUDA_LIB_DIR/stubs/libcuda.so" "$CUDA_LIB_DIR/libcuda.so"; \
  ln -sf "$CUDA_LIB_DIR/stubs/libcuda.so" "$CUDA_LIB_DIR/libcuda.so.1"

# Clone llama.cpp at the pinned tag
RUN git clone --depth=1 --branch "${LLAMA_CPP_TAG}" https://github.com/ggml-org/llama.cpp /src/llama.cpp
WORKDIR /src/llama.cpp

# Configure CUDA build; build the server; disable tests/examples
RUN set -eux; \
  for d in /usr/local/cuda/targets/aarch64-linux /usr/local/cuda/targets/sbsa-linux; do \
    if [ -d "$d/lib" ]; then CUDA_LIB_DIR="$d/lib"; break; fi; \
  done; \
  cmake -S . -B build \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON \
    -DLLAMA_BUILD_SERVER=ON \
    -DLLAMA_BUILD_TESTS=OFF \
    -DLLAMA_BUILD_EXAMPLES=OFF \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_LIBRARY_PATH="${CUDA_LIB_DIR}/stubs" \
    -DCMAKE_CUDA_ARCHITECTURES="121a-real" \
    -DCMAKE_EXE_LINKER_FLAGS="-Wl,-rpath,${CUDA_LIB_DIR} -Wl,-rpath-link,${CUDA_LIB_DIR}/stubs" && \
  cmake --build build -j

# Export artifacts: server + all shared libs produced in build/bin
RUN set -eux; \
  mkdir -p /out/bin /out/lib; \
  install -Dm755 build/bin/llama-server /out/bin/llama-server; \
  find build/bin -maxdepth 1 -type f -name "*.so*" -exec install -Dm755 {} /out/lib/ \; ; \
  ls -l /out/bin /out/lib

# ---------- Stage 2: runtime with CUDA + llama-swap + llama-server ----------
FROM nvidia/cuda:13.0.0-runtime-ubuntu22.04

# Re-declare args used in this stage
ARG LLAMA_SWAP_TAG
ARG SWAP_ARCH=linux_arm64

# Runtime deps (curl for fetching llama-swap; libcurl4 for CURL-enabled server; libgomp1 for OpenMP)
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates curl libgomp1 libcurl4 && \
    rm -rf /var/lib/apt/lists/*

# Fetch llama-swap from the pinned tag; strip leading 'v' for the asset filename
RUN set -eux; \
    ver_no_v="${LLAMA_SWAP_TAG#v}"; \
    curl -L -o /tmp/llama-swap.tgz \
      "https://github.com/mostlygeek/llama-swap/releases/download/${LLAMA_SWAP_TAG}/llama-swap_${ver_no_v}_${SWAP_ARCH}.tar.gz"; \
    tar -xzf /tmp/llama-swap.tgz -C /usr/local/bin llama-swap; \
    chmod +x /usr/local/bin/llama-swap; \
    rm -f /tmp/llama-swap.tgz

# Add llama-server and its shared libs from the build stage
COPY --from=build /out/bin/llama-server /usr/local/bin/llama-server
COPY --from=build /out/lib/ /usr/local/lib/

# Ensure the loader can find both our libs and CUDA's libs
RUN echo "/usr/local/lib" > /etc/ld.so.conf.d/llama.conf && ldconfig
ENV LD_LIBRARY_PATH=/usr/local/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# NVIDIA runtime (host mounts real libcuda at run-time)
ENV NVIDIA_VISIBLE_DEVICES=all \
    NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app
EXPOSE 8080

# llama-swap reads /app/config.yaml (mount it there via Compose)
CMD ["llama-swap", "--config", "/app/config.yaml", "--listen", "0.0.0.0:8080"]

And this Compose YAML:

configs:
  llama-swap-config:
    content: |
      healthCheckTimeout: 120
      startPort: 20001

      macros:
        latest-llama: >
          llama-server

      models:
        <fillThisOut>

networks:
  local:
    external: true
    name: local

services:
  llama:
    container_name: llama
    image: llama:1
    networks:
      - local
    ports:
      - "8080:8080"
    restart: unless-stopped
    environment:
      OMP_NUM_THREADS: "4"
      NVIDIA_VISIBLE_DEVICES: "0"
      NVIDIA_DRIVER_CAPABILITIES: "compute,utility,graphics"
      # Do not set both to true
      GGML_CUDA_FORCE_MMQ: "1"
      GGML_CUDA_FORCE_CUBLAS: "0"
    volumes:
      - /path/to/models:/models:ro
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    configs:
      - source: llama-swap-config
        target: /app/config.yaml

volumes:
  llama-cache:

aquartulli · January 27, 2026, 11:31am

@cosinus why not providing your Dockerfile to the llama.cpp guys on GH?

It seems the issue is still open.

cosinus · January 28, 2026, 6:39pm

I think there is no real interest.

github.com/ggml-org/llama.cpp

Compile bug: Container build fails due to incorrect LD_LIBRARY_PATH for DGX Spark/GB 10

opened 10:07AM - 14 Dec 25 UTC

stelterlab

bug-unconfirmed stale

### Git commit ``` cosinus@vroomfondel:~/src/llama.cpp$ git rev-parse HEAD 2540…98a279691615438f1a1bbe87f372199e13ea ``` ### Operating systems Linux ### GGML backends CUDA ### Problem description & steps to reproduce The container build fails when trying to build a container for NVIDIA GB10 systems (DGX Spark, ASUS Ascent GX10 etc.). ``` docker build -f .devops/cuda.Dockerfile --build-arg UBUNTU_VERSION=24.04 --build-arg CUDA_VERSION=13.0.2 --build-arg CUDA_DOCKER_ARCH=121 --target server -t llama.cpp:server-spark . ``` GB10 needs CUDA 13.0 and 121 as architecture. But this fails, because the default LD_LIBRARY_PATH inside the build container does not contain the path to libcuda.so.1. ``` #14 92.07 [ 62%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o #14 92.23 [ 62%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o #14 92.33 [ 62%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o #14 92.38 [ 62%] Linking CXX executable ../../bin/llama-simple #14 92.44 /usr/bin/ld: warning: libcuda.so.1, needed by ../../bin/libggml-cuda.so.0.9.4, not found (try using -rpath or -rpath-link) #14 92.45 [ 62%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o ``` The path inside the build container points to /usr/local/cuda/lib64: ``` root@3e1024dcf4f9:$ env|grep LIBRARY LIBRARY_PATH=/usr/local/cuda/lib64/stubs LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64 ``` But it should point to /usr/local/cuda-13/compat. The error above was what I got when I started around 2025/12/05. I retried today. This time it breaks at: ``` #11 71.32 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q4_0-q4_0.cu.o #11 71.74 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q8_0-q8_0.cu.o #11 73.48 [ 49%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-f16-f16.cu.o #11 85.47 [ 49%] Linking CUDA shared module ../../../bin/libggml-cuda.so #11 85.70 [ 49%] Built target ggml-cuda #11 85.70 gmake: *** [Makefile:136: all] Error 2 #11 ERROR: process "/bin/sh -c if [ \"${CUDA_DOCKER_ARCH}\" != \"default\" ]; then export CMAKE_ARGS=\"-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}\"; fi && cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && cmake --build build --config Release -j$(nproc)" did not complete successfully: exit code: 2 ------ ``` So I modified the Dockerfile and added LD_LIBRARY_PATH=/usr/local/cuda-13/compat: https://gist.github.com/stelterlab/33885c600c102792acb1638ca7d2d7e9 and only stripped the cmake args. I would love to see this as a .devops/spark.Dockerfile in here. Not sure if this is also relevant for the NVIDIA Jetson AGX Thor as I don't have one. It would also to nice to have a ready made container in the official repo as currently no arm64 versions are available at all. ### First Bad Commit _No response_ ### Compile command ```shell docker build --progress=plain -f .devops/cuda.Dockerfile --build-arg UBUNTU_VERSION=24.04 --build-arg CUDA_VERSION=13.0.2 --build-arg CUDA_DOCKER_ARCH=121 --target server -t llama.cpp:server-spark . 2>&1 | tee -a /tmp/cuda-build-current.txt ``` ### Relevant log output ```shell #11 70.94 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_9.cu.o #11 71.32 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q4_0-q4_0.cu.o #11 71.74 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q8_0-q8_0.cu.o #11 73.48 [ 49%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-f16-f16.cu.o #11 85.47 [ 49%] Linking CUDA shared module ../../../bin/libggml-cuda.so #11 85.70 [ 49%] Built target ggml-cuda #11 85.70 gmake: *** [Makefile:136: all] Error 2 #11 ERROR: process "/bin/sh -c if [ \"${CUDA_DOCKER_ARCH}\" != \"default\" ]; then export CMAKE_ARGS=\"-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}\"; fi && cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && cmake --build build --config Release -j$(nproc)" did not complete successfully: exit code: 2 ------ > [build 5/7] RUN if [ "121" != "default" ]; then export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=121"; fi && cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && cmake --build build --config Release -j$(nproc): 67.63 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_6.cu.o 67.69 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_7.cu.o 68.48 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_8.cu.o 70.94 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmf-instance-ncols_9.cu.o 71.32 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q4_0-q4_0.cu.o 71.74 [ 48%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-q8_0-q8_0.cu.o 73.48 [ 49%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/fattn-vec-instance-f16-f16.cu.o 85.47 [ 49%] Linking CUDA shared module ../../../bin/libggml-cuda.so 85.70 [ 49%] Built target ggml-cuda 85.70 gmake: *** [Makefile:136: all] Error 2 ------ cuda.Dockerfile:21 -------------------- 20 | 21 | >>> RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \ 22 | >>> export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \ 23 | >>> fi && \ 24 | >>> cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \ 25 | >>> cmake --build build --config Release -j$(nproc) 26 | -------------------- ERROR: failed to build: failed to solve: process "/bin/sh -c if [ \"${CUDA_DOCKER_ARCH}\" != \"default\" ]; then export CMAKE_ARGS=\"-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}\"; fi && cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DLLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && cmake --build build --config Release -j$(nproc)" did not complete successfully: exit code: 2 ```

What reminds for trying to address this to NVIDIA. As they are building the base containers.

cosinus · January 28, 2026, 7:14pm

Well. The current setup of the LD_LIBRARY_PATH seems to intended.

The LD_LIBRARY_PATH is set inside the container to legacy nvidia-docker v1 paths that do not exist on newer installations. This is done to maintain compatibility for our partners that still use nvidia-docker v1 and this will not be changed for the forseable future.

https://gitlab.com/nvidia/container-images/cuda#ld_library_path-notice

Popped up 2019… LD_LIBRARY_PATH set incorrectly on `nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04` (#47) · Issues · nvidia / container-images / cuda · GitLab - still unchanged since then.

pontostroy · February 24, 2026, 11:36am

I make llama.cpp mirror and build official Dockerfile with some changes for spark.

github.com/pontostroy/llama.cpp

.devops/cuda-new-spark.Dockerfile

master

ARG UBUNTU_VERSION=24.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=13.1.0
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=121a-real

RUN apt-get update -y && \
    apt-get upgrade -y && \
    apt-get install -y build-essential cmake python3 python3-pip git libssl-dev libgomp1

WORKDIR /app

COPY . .

This file has been truncated. show original

ghcr.io/pontostroy/llama.cpp:full-cuda13

docker run -it -v /home/pont/.cache/llama.cpp:/root/.cache/llama.cpp --gpus=all ghcr.io/pontostroy/llama.cpp:full-cuda13 --bench --model /root/.cache/llama.cpp/unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-MXFP4_MOE.gguf -fa 1
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu.so
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B MXFP4 MoE | 40.73 GiB | 79.67 B | CUDA | 99 | 1 | pp512 | 1274.41 ± 3.06 |
| qwen3next 80B.A3B MXFP4 MoE | 40.73 GiB | 79.67 B | CUDA | 99 | 1 | tg128 | 45.17 ± 0.11 |

pontostroy · February 24, 2026, 5:05pm

Also llama.cpp + llama-swap image, all from source, not just binary copy.

ghcr.io/pontostroy/llama-swap:cuda-latest

docker run -it --net=host  -v /home/pont/.cache/llama.cpp:/root/.cache/llama.cpp:ro -v /home/pont/llama-swap/config_docker.yaml:/root/config.yaml:ro --gpus=all ghcr.io/pontostroy/llama-swap:cuda-latest -config /root/config.yaml

martinB78 · March 29, 2026, 12:49pm

thats a cool idea…

guess I would love a perfect Portainer Stack one command installation of

LiteLLM (The Gateway): This acts as the universal translator and traffic cop. It takes all the different quirks of your various backends and unifies them into a single, clean API. Your frontends never have to care which engine is actually doing the work.
Llama-Swap (The VRAM Manager): This is your memory bouncer. By strictly controlling what gets loaded and unloaded, it prevents those massive 106B and 35B models from colliding and crashing your GPU memory.
vLLM (The Speed Demon): When you have raw .safetensors and you need maximum throughput and token-generation speed, vLLM handles the heavy lifting perfectly.
llama.cpp & Ollama (The Efficiency Engines): These handle the highly quantized .gguf files, allowing you to run incredibly smart, massive models that shouldn’t normally fit on a single GPU.

all images optimised for the dgx spark of course.

mabe ollama is unnessasary as llama.cpp also loads the gguf files

Topic		Replies	Views
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	28	8323	January 20, 2026
New pre-built vLLM Docker Images for NVIDIA DGX Spark DGX Spark / GB10	73	9527	March 27, 2026
HOW-TO: setup-dgx-spark docker inference - A "Sane" Inference Stack for GB10 (Need Contributors!) DGX Spark / GB10 Projects docker , llama , dgx	39	2874	June 21, 2026
Llama.cpp can't work properly with docker. Multi-modal functionality fails with a CUDA internal error Jetson Thor cuda , cublas , llama	8	394	June 9, 2026
Single node and Dual node llama.cpp build flag DGX Spark / GB10 llama	5	314	March 11, 2026
Compiling llama.cpp DGX Spark / GB10 llama	14	2524	February 7, 2026
DGX Spark: 13 → 49 tok/s with Qwen3.5-35B — Native SM121 Kernel Build Guide DGX Spark / GB10 Projects cuda , cusparse	13	1465	April 1, 2026
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10 deepseek	20	4712	January 25, 2026
Step-3.5-Flash on Single Spark with 256k context DGX Spark / GB10 Projects llama	2	852	March 3, 2026
vLLM containers DGX Spark / GB10	44	2400	March 28, 2026

Building llama.cpp container images for Spark/GB10

1️⃣ Build a GB10-compatible llama.cpp Docker image

2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

3️⃣ Run llama.cpp server container on port 3010

4️⃣ Test the HTTP API on the Spark

5️⃣ Update your prompt generation script (mascot prompts)

Related topics

1️⃣ Build a GB10-compatible `llama.cpp` Docker image

3️⃣ Run `llama.cpp` server container on port 3010