Guide: Gemma 4 31B on DGX Spark via NemoClaw — Dual-Model Setup Guide

Overview

This guide documents how to run Google Gemma 4 31B IT alongside NVIDIA Nemotron 3 Super 120B on a DGX Spark, switching between them at runtime via a single shell command. Chief (the OpenClaw agent) runs inside a NemoClaw sandbox and communicates with models through OpenShell’s inference.local proxy.

This setup is notable because:

  • No one has publicly documented running dual local models through NemoClaw’s inference routing
  • The documented openshell inference set provider names (compatible-endpoint, vllm-local) do not work in openshell 0.0.21 — we found an undocumented openshell provider create workaround via a GitHub issue
  • Gemma 4’s thinking mode breaks OpenClaw agent responses — the fix required server-side template overrides, not config changes
  • The entire model swap (config + routing + restart) is automated in a single script

Architecture

┌─────────────────────────────────────────────────┐
│  DGX Spark (128GB unified memory)               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │ NemoClaw Sandbox     │                       │
│  │ (OpenClaw 2026.4.9)  │                       │
│  │                      │                       │
│  │ inference.local ─────┼──► OpenShell Proxy    │
│  └──────────────────────┘         │             │
│                                   │             │
│                          ┌────────┴────────┐    │
│                          │                 │    │
│                    nvidia-prod        llama-cpp  │
│                          │                 │    │
│                    NIM (local)      localhost:8000│
│                    Nemotron 3       llama-server │
│                    Super 120B       (Gemma 4)   │
│                    (on Blackwell)                │
└─────────────────────────────────────────────────┘

Hardware & Software

Component Detail
Hardware DGX Spark, GB10 Grace Blackwell, 128GB LPDDR5X
OS Ubuntu 24.04.4 LTS (aarch64)
CUDA 13.0 (V13.0.88)
openshell CLI 0.0.21
Gateway image Package openshell/cluster · GitHub
NemoClaw CLI v0.1.0
OpenClaw 2026.4.9
llama.cpp Built from source (sm_121)
Model (NIM) nvidia/nemotron-3-super-120b-a12b via NIM (local inference on Blackwell GPU)
Model (llama.cpp) gemma-4-31B-it-f16.gguf (62GB, F16) via llama-server (local inference on Blackwell GPU)

Memory Budget

Measured Memory Usage (April 9, 2026)

With NemoClaw infrastructure + llama-server (Gemma 4 31B F16) running simultaneously:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi       108Gi       1.0Gi       263Mi        12Gi        13Gi
Swap:           15Gi       8.4Gi       7.6Gi

$ nvidia-smi (GPU processes)
  llama-server                    64107MiB
  Xorg                              268MiB
  gnome-shell                        230MiB
  firefox                          1047MiB
  telegram-desktop                    87MiB

Key observations:

  • Total system memory: 121GB usable of 128GB (unified CPU/GPU)
  • llama-server alone: ~64GB (model weights + KV cache for 65K context)
  • NemoClaw/OpenShell containers + OS + desktop: ~44GB
  • Free memory with Gemma loaded: ~1GB free, ~13GB available (buff/cache)
  • System is actively swapping 8.4GB to disk
  • GPU utilization: 6% idle, spikes during inference

Memory Budget Summary

Component Approximate Memory
NemoClaw/OpenShell + k3s + Docker + NIM (Nemotron) ~40 GB
OS + Desktop (Xorg, GNOME, Firefox, Telegram) ~4 GB
Gemma 4 31B F16 model weights ~62 GB
KV cache (65K context, q8_0) ~2-3 GB
Total when both models active ~108 GB of 121 GB usable

⚠️ Memory Warning

Running Gemma 4 31B at F16 precision pushes the DGX Spark close to its limits (89% utilization, active swap). The system works but:

  • Stop llama-server when not in use (Ctrl+C) to free ~64GB
  • Close Firefox and unnecessary desktop apps to recover ~1-2GB
  • Monitor swap usage — heavy swapping degrades inference speed

Lighter Alternative: Q4_K_M Quantization

If you want Gemma 4 to coexist comfortably with the full NemoClaw stack without swapping:

# Download quantized model (~18GB instead of 62GB)
source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-Q4_K_M.gguf \
  --local-dir ~/models/gemma-4-31B-it-GGUF

# Start with quantized model
cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

No config or script changes needed — the model ID and OpenClaw config stay the same. Only the --model path changes.

Model Variant Size Memory Used Swap? Quality
F16 (current) 62GB ~64GB Yes, 8.4GB Maximum
Q4_K_M ~18GB ~20GB No Good (slight degradation)

Also Consider: Gemma 4 26B-A4B (MoE)

The 26B MoE variant activates only 3.8B parameters per forward pass:

  • Faster inference (45-60 tok/s reported vs ~3.5 tok/s for 31B dense)
  • Lower memory (~28GB at F16, ~10GB at Q4)
  • Community reports better stability on DGX Spark
  • Separate GGUF download required

Part 1: Building llama.cpp

Prerequisites

git --version    # 2.43.0+
cmake --version  # 3.28+
nvcc --version   # CUDA 13.0+

Install HuggingFace CLI

python3 -m venv ~/llama-cpp-venv
source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version

Clone and Build

git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

Build takes ~5-10 minutes. Binary is at ~/llama.cpp/build/bin/llama-server.

Download the Model

source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-f16.gguf \
  --local-dir ~/models/gemma-4-31B-it-GGUF

~62GB download. On DGX Spark with fast network, completed in ~2 minutes at 546MB/s. Resumable if interrupted.


Part 2: Starting the llama-server

The Critical Flags

cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

Wait for: main: server is listening on http://0.0.0.0:8000

Parameter Explanations

Parameter Value Why
--host 0.0.0.0 Bind all interfaces Required — OpenShell’s Docker bridge network needs to reach the server
--port 8000 Port 8000 Matches the OpenShell provider config (default vLLM port)
--n-gpu-layers 99 Offload all layers to GPU DGX Spark has enough VRAM
--ctx-size 65536 64K context window OpenClaw’s full agent prompt (workspace files + system prompt) uses ~8000 tokens. 8192 is too small — caused context overflow. 65536 provides comfortable headroom
--threads 8 CPU threads For non-GPU work
--flash-attn on Enable flash attention Performance optimization (requires explicit on value, bare flag errors)
--cache-type-k q8_0 KV cache quantization Reduces memory for large context windows while maintaining quality
--cache-type-v q8_0 KV cache quantization Same as above for values
--jinja Enable Jinja templates Required for --chat-template-kwargs to work
--chat-template-kwargs '{"enable_thinking":false}' Disable thinking mode THE CRITICAL FIX — see below

Why Thinking Mode Must Be Disabled

Gemma 4 31B IT has thinking/reasoning mode enabled by default in its chat template. When active:

  1. The model puts all output in reasoning_content field instead of content
  2. The content field is returned as an empty string ""
  3. OpenClaw reads content, gets nothing, and returns “No reply from agent”
  4. The think: false parameter in the API request works (tested via curl), but OpenClaw does not send it
  5. Server-level --chat-template-kwargs '{"enable_thinking":false}' is the only reliable fix

Discovery process:

  • Without the flag: content: "", reasoning_content: "The answer is..." → OpenClaw sees empty response
  • With "think": false in curl request: content: "The answer is 4" → Works, but OpenClaw doesn’t send this
  • With --chat-template-kwargs '{"enable_thinking":false}': content: "The answer is 4", reasoning_content: "NONE" → Works for all requests

Note from Google docs: Larger Gemma 4 models (26B, 31B) may occasionally emit thought channels even when thinking is disabled. The --jinja template override is more reliable than request-level parameters.

Verify the Server (Outside Sandbox)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":100}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); \
    print('content:', repr(d['choices'][0]['message'].get('content'))); \
    print('reasoning:', repr(d['choices'][0]['message'].get('reasoning_content','NONE')))"

Expected: content: '2 + 2 = 4', reasoning: 'NONE'


Part 3: OpenShell Provider Registration

The Problem

The NemoClaw sandbox cannot reach localhost:8000 directly — the sandbox runs in an isolated network namespace. All inference traffic must go through OpenShell’s inference.local proxy.

What Did NOT Work

The documented openshell inference set provider names are not implemented in openshell 0.0.21:

# ALL OF THESE FAIL with "provider not found":
openshell inference set --provider compatible-endpoint --model gemma-4-31B-it
openshell inference set --provider vllm-local --model gemma-4-31B-it
openshell inference set --provider vllm --model gemma-4-31B-it
openshell inference set --provider local --model gemma-4-31B-it
openshell inference set --provider custom --model gemma-4-31B-it
openshell inference set --provider openai-compatible --model gemma-4-31B-it
openshell inference set --provider "Local Inference" --model gemma-4-31B-it
openshell inference set --provider "OpenAI-Compatible" --model gemma-4-31B-it

Bug report filed on NemoClaw GitHub.

What DOES Work: openshell provider create

Found via NemoClaw GitHub issue #893. The key is host.openshell.internal — the DNS name the sandbox uses to reach the host machine.

openshell provider create --name llama-cpp \
  --type openai \
  --credential "OPENAI_API_KEY=unused" \
  --config "OPENAI_BASE_URL=http://host.openshell.internal:8000/v1"

This registers llama-cpp as a named provider. It persists across restarts — only needs to be run once.

Switching Inference

# Switch to Gemma
openshell inference set --provider llama-cpp --model gemma-4-31B-it --no-verify

# Switch to Nemotron
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b --no-verify

Verify from Inside Sandbox

nemoclaw my-assistant connect
curl -s https://inference.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4-31B-it","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['choices'][0]['message'].get('content'))"
exit

Part 4: OpenClaw Configuration (What Changed and What Didn’t)

The Key Constraint

OpenClaw inside the sandbox talks to https://inference.local/v1 — a single endpoint proxied by OpenShell. You cannot have two providers simultaneously in openclaw.json. Switching models requires updating both the OpenShell routing AND the OpenClaw config.

What Changed Between Configs

Setting Nemotron Config Gemma Config
models.providers.inference.api openai-responses openai-completions
models.providers.inference.models[0].id nvidia/nemotron-3-super-120b-a12b gemma-4-31B-it
models.providers.inference.models[0].name inference/nvidia/nemotron-3-super-120b-a12b inference/gemma-4-31B-it
models.providers.inference.models[0].contextWindow 131072 65536
models.providers.inference.models[0].compat (not present) {"requiresStringContent": true}
agents.defaults.model.primary inference/nvidia/nemotron-3-super-120b-a12b inference/gemma-4-31B-it

What Did NOT Change

These stay identical across both configs:

  • baseUrl: https://inference.local/v1 (always through OpenShell proxy)
  • apiKey: unused
  • env.TAVILY_API_KEY
  • plugins.entries.tavily
  • gateway settings (token, auth, controlUi)
  • commands settings
  • channels.defaults
  • All other agent defaults

The compat Flag That Matters

"compat": {
  "requiresStringContent": true
}

requiresStringContent: true — Required. llama-server’s OpenAI-compatible endpoint expects string content, not structured content-part arrays. Without this, OpenClaw sends [{"type": "text", "text": "..."}] which causes invalid type: sequence, expected a string errors.

supportsTools: false — Do NOT use. Initially added based on inferrs documentation, but this flag causes OpenClaw to suppress valid agent responses. Removing it fixed the “No reply from agent” issue for complex questions.

The Configs

Both configs are stored on the host (outside Docker) at:

  • ~/.nemoclaw/configs/openclaw-nemotron.json
  • ~/.nemoclaw/configs/openclaw-gemma.json

The gateway regenerates the agent-level models.json from openclaw.json on every restart, so manual edits to ~/.openclaw/agents/main/agent/models.json inside the sandbox are wiped. Always edit the master config via docker cp.


Part 5: The Switch Script

Location

~/.nemoclaw/configs/switch-model.sh

Script

#!/bin/bash
set -euo pipefail

CONFIGS_DIR="$HOME/.nemoclaw/configs"
CONTAINER="openshell-cluster-nemoclaw"

if [[ "${1:-}" == "gemma" ]]; then
  CONFIG="$CONFIGS_DIR/openclaw-gemma.json"
  PROVIDER="llama-cpp"
  MODEL="gemma-4-31B-it"
  echo "Switching to Gemma 4 31B..."
elif [[ "${1:-}" == "nemotron" ]]; then
  CONFIG="$CONFIGS_DIR/openclaw-nemotron.json"
  PROVIDER="nvidia-prod"
  MODEL="nvidia/nemotron-3-super-120b-a12b"
  echo "Switching to Nemotron 3 Super 120B..."
else
  echo "Usage: bash switch-model.sh [gemma|nemotron]"
  exit 1
fi

# Switch inference provider
openshell inference set --provider "$PROVIDER" --model "$MODEL" --no-verify

# Update openclaw.json inside sandbox
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json" -path "*/rootfs/*" 2>/dev/null | head -1) && mv $ROOTFS ${ROOTFS}.old'
docker cp "$CONFIG" "$CONTAINER":/tmp/openclaw-new.json
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json.old" -path "*/rootfs/*" 2>/dev/null | head -1 | sed "s|.old||") && cp /tmp/openclaw-new.json $ROOTFS'

# Restart services
nemoclaw stop
nemoclaw start

echo "Done. Use /new on Telegram to start a fresh session."

Usage

# Prerequisites: llama-server must be running for Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma

# Switch back to Nemotron via NIM (no llama-server needed)
bash ~/.nemoclaw/configs/switch-model.sh nemotron

After switching, always use /new on Telegram to start a fresh session.


Part 6: Complete Workflow

Starting Gemma 4 from Scratch

# 1. SSH to DGX Spark
ssh marcopapa@spark-dcce.taila3bbce.ts.net

# 2. Start llama-server (in one terminal, keep open)
cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

# 3. In a second terminal, switch to Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma

# 4. On Telegram: /new, then chat with Gemma 4

Switching Back to Nemotron

# 1. Switch inference routing + config
bash ~/.nemoclaw/configs/switch-model.sh nemotron

# 2. (Optional) Stop llama-server to free GPU memory
# Ctrl+C in the llama-server terminal

# 3. On Telegram: /new, then chat with Nemotron

Part 7: Troubleshooting

“No reply from agent”

Cause Fix
Thinking mode enabled Restart llama-server with --jinja --chat-template-kwargs '{"enable_thinking":false}'
supportsTools: false in compat Remove it — keep only requiresStringContent: true
Context overflow (ctx-size too small) Increase --ctx-size (8192 is too small, 65536 works)
openclaw.json not updated Run switch-model.sh or verify with docker exec cat
Stale Telegram session Use /new to start fresh

“Context overflow: prompt too large”

The full OpenClaw agent prompt (SOUL.md, USER.md, AGENTS.md, IDENTITY.md, HEARTBEAT.md, TOOLS.md + system prompt) uses ~8000 tokens. The --ctx-size must be significantly larger than this to leave room for conversation.

ctx-size Status
8192 ❌ Too small — context overflow
32768 ⚠️ Minimum viable
65536 ✅ Recommended
131072 ✅ Maximum headroom (more memory)

Server Shows 200 But No Reply

The server processed the request successfully but OpenClaw dropped the response. Check:

  1. Is content empty in the response? (thinking mode issue)
  2. Is compat.supportsTools set to false? (remove it)
  3. Is the model ID in openclaw.json matching what the server expects?

Provider Not Found

openshell inference set --provider <name> only works with registered providers. In openshell 0.0.21, use openshell provider create to register custom providers first.


Part 8: Performance Notes (Measured)

Both models run locally on the DGX Spark’s Blackwell GPU — Nemotron via NIM, Gemma via llama.cpp.

Metric Gemma 4 31B F16 (llama.cpp) Nemotron 3 Super 120B (NIM)
Prompt processing 20-43 tokens/sec (measured) Not exposed by NIM
Generation 3.3-3.6 tokens/sec (measured) ~35 tokens/sec (measured)
Simple query end-to-end 2-3 seconds (measured) <1 second (measured)
Full agent response 3-4 seconds typical (measured) ~1 second typical (measured)
Context window 65K (configured), max 256K 131K
Memory footprint ~64 GB (F16 weights + KV cache) Included in NIM container (~40 GB shared with k3s)
Inference backend llama.cpp (llama-server) NVIDIA NIM (optimized for Blackwell)
Cost Free (local) Free (local)
Tool calling Works with requiresStringContent compat Full native support
Thinking/reasoning Must be disabled at server level Native support
Quantization options F16, Q8_0, Q4_K_M available NIM-managed (optimized)

Measured Timings

Gemma 4 31B (from llama-server logs):

# Simple question ("What is 2+2?")
prompt eval time =  664ms /  27 tokens (24.6 ms/token, 40.6 tok/sec)
       eval time = 1998ms /   8 tokens (249.8 ms/token, 4.0 tok/sec)

# Full agent prompt ("What model are you running?")
prompt eval time =  810ms /  35 tokens (23.2 ms/token, 43.2 tok/sec)
       eval time = 2538ms /   9 tokens (282.1 ms/token, 3.5 tok/sec)

# Agent prompt + response ("Hello")
prompt eval time =  804ms /  24 tokens (33.5 ms/token, 29.8 tok/sec)
       eval time = 2864ms /  10 tokens (286.4 ms/token, 3.5 tok/sec)

Nemotron 3 Super 120B (from curl timing inside sandbox):

# Simple question ("What is 2+2? Reply in one sentence.")
# 28 prompt tokens → 33 completion tokens
# Total time: 929ms (end-to-end including prompt processing)
# ~35.5 completion tokens/sec

Note: The full agent prompt is ~8000 tokens (workspace files + system prompt), but most are cached from the previous turn — only new tokens need processing.

Why Nemotron Is Faster

Nemotron runs through NVIDIA NIM, which is optimized for the Blackwell architecture with TensorRT-LLM, quantization, and batching optimizations. llama.cpp is a general-purpose inference engine without Blackwell-specific optimizations. Despite being a much larger model (120B vs 31B parameters), Nemotron generates tokens ~10x faster thanks to NIM’s optimization stack.

When to Use Each Model

Use Nemotron when:

  • Speed matters — ~10x faster generation
  • Tool calling is needed — full native support
  • Complex multi-turn conversations — larger context window, better agent behavior

Use Gemma 4 when:

  • Experimenting with alternative models
  • Testing Google’s latest open model architecture
  • Privacy-sensitive tasks (no NIM dependency)
  • Teaching and demonstration purposes (CSCI 599)
  • Comparing model behaviors and capabilities

Part 9: After OpenClaw Updates

When upgrading OpenClaw inside the sandbox:

  1. The openclaw.json configs at ~/.nemoclaw/configs/ are on the host — they survive sandbox rebuilds
  2. After upgrade, diff the new default config against your saved versions
  3. Update both openclaw-nemotron.json and openclaw-gemma.json if schema changes occurred
  4. The openshell provider create registration persists — no need to recreate
  5. The llama.cpp build and model files are on the host — they survive everything

Files Reference

File Location Purpose
llama-server binary ~/llama.cpp/build/bin/llama-server Inference server
Gemma 4 model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf 62GB GGUF model file
Nemotron config ~/.nemoclaw/configs/openclaw-nemotron.json OpenClaw config for Nemotron
Gemma config ~/.nemoclaw/configs/openclaw-gemma.json OpenClaw config for Gemma
Switch script ~/.nemoclaw/configs/switch-model.sh One-command model switching
HuggingFace venv ~/llama-cpp-venv/ Python venv for hf CLI

Credits

Guide by Marco Papa (@marcopapa99 on X), USC Viterbi School of Engineering

gemma4-nemoclaw-dual-model.zip (2.4 KB)

1 Like

Fantastic write up!

I did not see how you installed the Nemotron 3 Super 120? Can you add a link to the Nim used in the Credits or a section on how that was installed?

NIM deployment instructions here - nemotron-3-super-120b-a12b Model by NVIDIA | NVIDIA NIM