Overview
This guide documents how to run Google Gemma 4 31B IT alongside NVIDIA Nemotron 3 Super 120B on a DGX Spark, switching between them at runtime via a single shell command. Chief (the OpenClaw agent) runs inside a NemoClaw sandbox and communicates with models through OpenShell’s inference.local proxy.
This setup is notable because:
- No one has publicly documented running dual local models through NemoClaw’s inference routing
- The documented
openshell inference setprovider names (compatible-endpoint,vllm-local) do not work in openshell 0.0.21 — we found an undocumentedopenshell provider createworkaround via a GitHub issue - Gemma 4’s thinking mode breaks OpenClaw agent responses — the fix required server-side template overrides, not config changes
- The entire model swap (config + routing + restart) is automated in a single script
Architecture
┌─────────────────────────────────────────────────┐
│ DGX Spark (128GB unified memory) │
│ │
│ ┌──────────────────────┐ │
│ │ NemoClaw Sandbox │ │
│ │ (OpenClaw 2026.4.9) │ │
│ │ │ │
│ │ inference.local ─────┼──► OpenShell Proxy │
│ └──────────────────────┘ │ │
│ │ │
│ ┌────────┴────────┐ │
│ │ │ │
│ nvidia-prod llama-cpp │
│ │ │ │
│ NIM (local) localhost:8000│
│ Nemotron 3 llama-server │
│ Super 120B (Gemma 4) │
│ (on Blackwell) │
└─────────────────────────────────────────────────┘
Hardware & Software
| Component | Detail |
|---|---|
| Hardware | DGX Spark, GB10 Grace Blackwell, 128GB LPDDR5X |
| OS | Ubuntu 24.04.4 LTS (aarch64) |
| CUDA | 13.0 (V13.0.88) |
| openshell CLI | 0.0.21 |
| Gateway image | Package openshell/cluster · GitHub |
| NemoClaw CLI | v0.1.0 |
| OpenClaw | 2026.4.9 |
| llama.cpp | Built from source (sm_121) |
| Model (NIM) | nvidia/nemotron-3-super-120b-a12b via NIM (local inference on Blackwell GPU) |
| Model (llama.cpp) | gemma-4-31B-it-f16.gguf (62GB, F16) via llama-server (local inference on Blackwell GPU) |
Memory Budget
Measured Memory Usage (April 9, 2026)
With NemoClaw infrastructure + llama-server (Gemma 4 31B F16) running simultaneously:
$ free -h
total used free shared buff/cache available
Mem: 121Gi 108Gi 1.0Gi 263Mi 12Gi 13Gi
Swap: 15Gi 8.4Gi 7.6Gi
$ nvidia-smi (GPU processes)
llama-server 64107MiB
Xorg 268MiB
gnome-shell 230MiB
firefox 1047MiB
telegram-desktop 87MiB
Key observations:
- Total system memory: 121GB usable of 128GB (unified CPU/GPU)
- llama-server alone: ~64GB (model weights + KV cache for 65K context)
- NemoClaw/OpenShell containers + OS + desktop: ~44GB
- Free memory with Gemma loaded: ~1GB free, ~13GB available (buff/cache)
- System is actively swapping 8.4GB to disk
- GPU utilization: 6% idle, spikes during inference
Memory Budget Summary
| Component | Approximate Memory |
|---|---|
| NemoClaw/OpenShell + k3s + Docker + NIM (Nemotron) | ~40 GB |
| OS + Desktop (Xorg, GNOME, Firefox, Telegram) | ~4 GB |
| Gemma 4 31B F16 model weights | ~62 GB |
| KV cache (65K context, q8_0) | ~2-3 GB |
| Total when both models active | ~108 GB of 121 GB usable |
⚠️ Memory Warning
Running Gemma 4 31B at F16 precision pushes the DGX Spark close to its limits (89% utilization, active swap). The system works but:
- Stop llama-server when not in use (
Ctrl+C) to free ~64GB - Close Firefox and unnecessary desktop apps to recover ~1-2GB
- Monitor swap usage — heavy swapping degrades inference speed
Lighter Alternative: Q4_K_M Quantization
If you want Gemma 4 to coexist comfortably with the full NemoClaw stack without swapping:
# Download quantized model (~18GB instead of 62GB)
source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q4_K_M.gguf \
--local-dir ~/models/gemma-4-31B-it-GGUF
# Start with quantized model
cd ~/llama.cpp/build
./bin/llama-server \
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
--host 0.0.0.0 --port 8000 \
--n-gpu-layers 99 --ctx-size 65536 --threads 8 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--jinja --chat-template-kwargs '{"enable_thinking":false}'
No config or script changes needed — the model ID and OpenClaw config stay the same. Only the --model path changes.
| Model Variant | Size | Memory Used | Swap? | Quality |
|---|---|---|---|---|
| F16 (current) | 62GB | ~64GB | Yes, 8.4GB | Maximum |
| Q4_K_M | ~18GB | ~20GB | No | Good (slight degradation) |
Also Consider: Gemma 4 26B-A4B (MoE)
The 26B MoE variant activates only 3.8B parameters per forward pass:
- Faster inference (45-60 tok/s reported vs ~3.5 tok/s for 31B dense)
- Lower memory (~28GB at F16, ~10GB at Q4)
- Community reports better stability on DGX Spark
- Separate GGUF download required
Part 1: Building llama.cpp
Prerequisites
git --version # 2.43.0+
cmake --version # 3.28+
nvcc --version # CUDA 13.0+
Install HuggingFace CLI
python3 -m venv ~/llama-cpp-venv
source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version
Clone and Build
git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8
Build takes ~5-10 minutes. Binary is at ~/llama.cpp/build/bin/llama-server.
Download the Model
source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
gemma-4-31B-it-f16.gguf \
--local-dir ~/models/gemma-4-31B-it-GGUF
~62GB download. On DGX Spark with fast network, completed in ~2 minutes at 546MB/s. Resumable if interrupted.
Part 2: Starting the llama-server
The Critical Flags
cd ~/llama.cpp/build
./bin/llama-server \
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
--host 0.0.0.0 --port 8000 \
--n-gpu-layers 99 --ctx-size 65536 --threads 8 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--jinja --chat-template-kwargs '{"enable_thinking":false}'
Wait for: main: server is listening on http://0.0.0.0:8000
Parameter Explanations
| Parameter | Value | Why |
|---|---|---|
--host 0.0.0.0 |
Bind all interfaces | Required — OpenShell’s Docker bridge network needs to reach the server |
--port 8000 |
Port 8000 | Matches the OpenShell provider config (default vLLM port) |
--n-gpu-layers 99 |
Offload all layers to GPU | DGX Spark has enough VRAM |
--ctx-size 65536 |
64K context window | OpenClaw’s full agent prompt (workspace files + system prompt) uses ~8000 tokens. 8192 is too small — caused context overflow. 65536 provides comfortable headroom |
--threads 8 |
CPU threads | For non-GPU work |
--flash-attn on |
Enable flash attention | Performance optimization (requires explicit on value, bare flag errors) |
--cache-type-k q8_0 |
KV cache quantization | Reduces memory for large context windows while maintaining quality |
--cache-type-v q8_0 |
KV cache quantization | Same as above for values |
--jinja |
Enable Jinja templates | Required for --chat-template-kwargs to work |
--chat-template-kwargs '{"enable_thinking":false}' |
Disable thinking mode | THE CRITICAL FIX — see below |
Why Thinking Mode Must Be Disabled
Gemma 4 31B IT has thinking/reasoning mode enabled by default in its chat template. When active:
- The model puts all output in
reasoning_contentfield instead ofcontent - The
contentfield is returned as an empty string"" - OpenClaw reads
content, gets nothing, and returns “No reply from agent” - The
think: falseparameter in the API request works (tested via curl), but OpenClaw does not send it - Server-level
--chat-template-kwargs '{"enable_thinking":false}'is the only reliable fix
Discovery process:
- Without the flag:
content: "",reasoning_content: "The answer is..."→ OpenClaw sees empty response - With
"think": falsein curl request:content: "The answer is 4"→ Works, but OpenClaw doesn’t send this - With
--chat-template-kwargs '{"enable_thinking":false}':content: "The answer is 4",reasoning_content: "NONE"→ Works for all requests
Note from Google docs: Larger Gemma 4 models (26B, 31B) may occasionally emit thought channels even when thinking is disabled. The --jinja template override is more reliable than request-level parameters.
Verify the Server (Outside Sandbox)
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma4","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":100}' \
| python3 -c "import sys,json; d=json.load(sys.stdin); \
print('content:', repr(d['choices'][0]['message'].get('content'))); \
print('reasoning:', repr(d['choices'][0]['message'].get('reasoning_content','NONE')))"
Expected: content: '2 + 2 = 4', reasoning: 'NONE'
Part 3: OpenShell Provider Registration
The Problem
The NemoClaw sandbox cannot reach localhost:8000 directly — the sandbox runs in an isolated network namespace. All inference traffic must go through OpenShell’s inference.local proxy.
What Did NOT Work
The documented openshell inference set provider names are not implemented in openshell 0.0.21:
# ALL OF THESE FAIL with "provider not found":
openshell inference set --provider compatible-endpoint --model gemma-4-31B-it
openshell inference set --provider vllm-local --model gemma-4-31B-it
openshell inference set --provider vllm --model gemma-4-31B-it
openshell inference set --provider local --model gemma-4-31B-it
openshell inference set --provider custom --model gemma-4-31B-it
openshell inference set --provider openai-compatible --model gemma-4-31B-it
openshell inference set --provider "Local Inference" --model gemma-4-31B-it
openshell inference set --provider "OpenAI-Compatible" --model gemma-4-31B-it
Bug report filed on NemoClaw GitHub.
What DOES Work: openshell provider create
Found via NemoClaw GitHub issue #893. The key is host.openshell.internal — the DNS name the sandbox uses to reach the host machine.
openshell provider create --name llama-cpp \
--type openai \
--credential "OPENAI_API_KEY=unused" \
--config "OPENAI_BASE_URL=http://host.openshell.internal:8000/v1"
This registers llama-cpp as a named provider. It persists across restarts — only needs to be run once.
Switching Inference
# Switch to Gemma
openshell inference set --provider llama-cpp --model gemma-4-31B-it --no-verify
# Switch to Nemotron
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b --no-verify
Verify from Inside Sandbox
nemoclaw my-assistant connect
curl -s https://inference.local/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma-4-31B-it","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}' \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['choices'][0]['message'].get('content'))"
exit
Part 4: OpenClaw Configuration (What Changed and What Didn’t)
The Key Constraint
OpenClaw inside the sandbox talks to https://inference.local/v1 — a single endpoint proxied by OpenShell. You cannot have two providers simultaneously in openclaw.json. Switching models requires updating both the OpenShell routing AND the OpenClaw config.
What Changed Between Configs
| Setting | Nemotron Config | Gemma Config |
|---|---|---|
models.providers.inference.api |
openai-responses |
openai-completions |
models.providers.inference.models[0].id |
nvidia/nemotron-3-super-120b-a12b |
gemma-4-31B-it |
models.providers.inference.models[0].name |
inference/nvidia/nemotron-3-super-120b-a12b |
inference/gemma-4-31B-it |
models.providers.inference.models[0].contextWindow |
131072 |
65536 |
models.providers.inference.models[0].compat |
(not present) | {"requiresStringContent": true} |
agents.defaults.model.primary |
inference/nvidia/nemotron-3-super-120b-a12b |
inference/gemma-4-31B-it |
What Did NOT Change
These stay identical across both configs:
baseUrl:https://inference.local/v1(always through OpenShell proxy)apiKey:unusedenv.TAVILY_API_KEYplugins.entries.tavilygatewaysettings (token, auth, controlUi)commandssettingschannels.defaults- All other agent defaults
The compat Flag That Matters
"compat": {
"requiresStringContent": true
}
requiresStringContent: true — Required. llama-server’s OpenAI-compatible endpoint expects string content, not structured content-part arrays. Without this, OpenClaw sends [{"type": "text", "text": "..."}] which causes invalid type: sequence, expected a string errors.
supportsTools: false — Do NOT use. Initially added based on inferrs documentation, but this flag causes OpenClaw to suppress valid agent responses. Removing it fixed the “No reply from agent” issue for complex questions.
The Configs
Both configs are stored on the host (outside Docker) at:
~/.nemoclaw/configs/openclaw-nemotron.json~/.nemoclaw/configs/openclaw-gemma.json
The gateway regenerates the agent-level models.json from openclaw.json on every restart, so manual edits to ~/.openclaw/agents/main/agent/models.json inside the sandbox are wiped. Always edit the master config via docker cp.
Part 5: The Switch Script
Location
~/.nemoclaw/configs/switch-model.sh
Script
#!/bin/bash
set -euo pipefail
CONFIGS_DIR="$HOME/.nemoclaw/configs"
CONTAINER="openshell-cluster-nemoclaw"
if [[ "${1:-}" == "gemma" ]]; then
CONFIG="$CONFIGS_DIR/openclaw-gemma.json"
PROVIDER="llama-cpp"
MODEL="gemma-4-31B-it"
echo "Switching to Gemma 4 31B..."
elif [[ "${1:-}" == "nemotron" ]]; then
CONFIG="$CONFIGS_DIR/openclaw-nemotron.json"
PROVIDER="nvidia-prod"
MODEL="nvidia/nemotron-3-super-120b-a12b"
echo "Switching to Nemotron 3 Super 120B..."
else
echo "Usage: bash switch-model.sh [gemma|nemotron]"
exit 1
fi
# Switch inference provider
openshell inference set --provider "$PROVIDER" --model "$MODEL" --no-verify
# Update openclaw.json inside sandbox
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json" -path "*/rootfs/*" 2>/dev/null | head -1) && mv $ROOTFS ${ROOTFS}.old'
docker cp "$CONFIG" "$CONTAINER":/tmp/openclaw-new.json
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json.old" -path "*/rootfs/*" 2>/dev/null | head -1 | sed "s|.old||") && cp /tmp/openclaw-new.json $ROOTFS'
# Restart services
nemoclaw stop
nemoclaw start
echo "Done. Use /new on Telegram to start a fresh session."
Usage
# Prerequisites: llama-server must be running for Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma
# Switch back to Nemotron via NIM (no llama-server needed)
bash ~/.nemoclaw/configs/switch-model.sh nemotron
After switching, always use /new on Telegram to start a fresh session.
Part 6: Complete Workflow
Starting Gemma 4 from Scratch
# 1. SSH to DGX Spark
ssh marcopapa@spark-dcce.taila3bbce.ts.net
# 2. Start llama-server (in one terminal, keep open)
cd ~/llama.cpp/build
./bin/llama-server \
--model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
--host 0.0.0.0 --port 8000 \
--n-gpu-layers 99 --ctx-size 65536 --threads 8 \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--jinja --chat-template-kwargs '{"enable_thinking":false}'
# 3. In a second terminal, switch to Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma
# 4. On Telegram: /new, then chat with Gemma 4
Switching Back to Nemotron
# 1. Switch inference routing + config
bash ~/.nemoclaw/configs/switch-model.sh nemotron
# 2. (Optional) Stop llama-server to free GPU memory
# Ctrl+C in the llama-server terminal
# 3. On Telegram: /new, then chat with Nemotron
Part 7: Troubleshooting
“No reply from agent”
| Cause | Fix |
|---|---|
| Thinking mode enabled | Restart llama-server with --jinja --chat-template-kwargs '{"enable_thinking":false}' |
supportsTools: false in compat |
Remove it — keep only requiresStringContent: true |
| Context overflow (ctx-size too small) | Increase --ctx-size (8192 is too small, 65536 works) |
| openclaw.json not updated | Run switch-model.sh or verify with docker exec cat |
| Stale Telegram session | Use /new to start fresh |
“Context overflow: prompt too large”
The full OpenClaw agent prompt (SOUL.md, USER.md, AGENTS.md, IDENTITY.md, HEARTBEAT.md, TOOLS.md + system prompt) uses ~8000 tokens. The --ctx-size must be significantly larger than this to leave room for conversation.
| ctx-size | Status |
|---|---|
| 8192 | ❌ Too small — context overflow |
| 32768 | ⚠️ Minimum viable |
| 65536 | ✅ Recommended |
| 131072 | ✅ Maximum headroom (more memory) |
Server Shows 200 But No Reply
The server processed the request successfully but OpenClaw dropped the response. Check:
- Is
contentempty in the response? (thinking mode issue) - Is
compat.supportsToolsset tofalse? (remove it) - Is the model ID in openclaw.json matching what the server expects?
Provider Not Found
openshell inference set --provider <name> only works with registered providers. In openshell 0.0.21, use openshell provider create to register custom providers first.
Part 8: Performance Notes (Measured)
Both models run locally on the DGX Spark’s Blackwell GPU — Nemotron via NIM, Gemma via llama.cpp.
| Metric | Gemma 4 31B F16 (llama.cpp) | Nemotron 3 Super 120B (NIM) |
|---|---|---|
| Prompt processing | 20-43 tokens/sec (measured) | Not exposed by NIM |
| Generation | 3.3-3.6 tokens/sec (measured) | ~35 tokens/sec (measured) |
| Simple query end-to-end | 2-3 seconds (measured) | <1 second (measured) |
| Full agent response | 3-4 seconds typical (measured) | ~1 second typical (measured) |
| Context window | 65K (configured), max 256K | 131K |
| Memory footprint | ~64 GB (F16 weights + KV cache) | Included in NIM container (~40 GB shared with k3s) |
| Inference backend | llama.cpp (llama-server) | NVIDIA NIM (optimized for Blackwell) |
| Cost | Free (local) | Free (local) |
| Tool calling | Works with requiresStringContent compat |
Full native support |
| Thinking/reasoning | Must be disabled at server level | Native support |
| Quantization options | F16, Q8_0, Q4_K_M available | NIM-managed (optimized) |
Measured Timings
Gemma 4 31B (from llama-server logs):
# Simple question ("What is 2+2?")
prompt eval time = 664ms / 27 tokens (24.6 ms/token, 40.6 tok/sec)
eval time = 1998ms / 8 tokens (249.8 ms/token, 4.0 tok/sec)
# Full agent prompt ("What model are you running?")
prompt eval time = 810ms / 35 tokens (23.2 ms/token, 43.2 tok/sec)
eval time = 2538ms / 9 tokens (282.1 ms/token, 3.5 tok/sec)
# Agent prompt + response ("Hello")
prompt eval time = 804ms / 24 tokens (33.5 ms/token, 29.8 tok/sec)
eval time = 2864ms / 10 tokens (286.4 ms/token, 3.5 tok/sec)
Nemotron 3 Super 120B (from curl timing inside sandbox):
# Simple question ("What is 2+2? Reply in one sentence.")
# 28 prompt tokens → 33 completion tokens
# Total time: 929ms (end-to-end including prompt processing)
# ~35.5 completion tokens/sec
Note: The full agent prompt is ~8000 tokens (workspace files + system prompt), but most are cached from the previous turn — only new tokens need processing.
Why Nemotron Is Faster
Nemotron runs through NVIDIA NIM, which is optimized for the Blackwell architecture with TensorRT-LLM, quantization, and batching optimizations. llama.cpp is a general-purpose inference engine without Blackwell-specific optimizations. Despite being a much larger model (120B vs 31B parameters), Nemotron generates tokens ~10x faster thanks to NIM’s optimization stack.
When to Use Each Model
Use Nemotron when:
- Speed matters — ~10x faster generation
- Tool calling is needed — full native support
- Complex multi-turn conversations — larger context window, better agent behavior
Use Gemma 4 when:
- Experimenting with alternative models
- Testing Google’s latest open model architecture
- Privacy-sensitive tasks (no NIM dependency)
- Teaching and demonstration purposes (CSCI 599)
- Comparing model behaviors and capabilities
Part 9: After OpenClaw Updates
When upgrading OpenClaw inside the sandbox:
- The
openclaw.jsonconfigs at~/.nemoclaw/configs/are on the host — they survive sandbox rebuilds - After upgrade, diff the new default config against your saved versions
- Update both
openclaw-nemotron.jsonandopenclaw-gemma.jsonif schema changes occurred - The
openshell provider createregistration persists — no need to recreate - The llama.cpp build and model files are on the host — they survive everything
Files Reference
| File | Location | Purpose |
|---|---|---|
| llama-server binary | ~/llama.cpp/build/bin/llama-server |
Inference server |
| Gemma 4 model | ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf |
62GB GGUF model file |
| Nemotron config | ~/.nemoclaw/configs/openclaw-nemotron.json |
OpenClaw config for Nemotron |
| Gemma config | ~/.nemoclaw/configs/openclaw-gemma.json |
OpenClaw config for Gemma |
| Switch script | ~/.nemoclaw/configs/switch-model.sh |
One-command model switching |
| HuggingFace venv | ~/llama-cpp-venv/ |
Python venv for hf CLI |
Credits
- NVIDIA DGX Spark Playbooks: Run models with llama.cpp on DGX Spark | DGX Spark
- NVIDIA NemoClaw docs: https://docs.nvidia.com/nemoclaw/latest/
- OpenClaw inferrs provider docs: inferrs - OpenClaw
- llama.cpp Gemma 4 discussion: Can't disable thinking in gemma4 (26b-a4b) · ggml-org/llama.cpp · Discussion #21338 · GitHub
- Google Gemma 4 thinking docs: https://ai.google.dev/gemma/docs/capabilities/thinking
- NemoClaw issue #893 (openshell provider create): Connecting Nemoclaw to self-hosted vLLM model on same host · Issue #893 · NVIDIA/NemoClaw · GitHub
Guide by Marco Papa (@marcopapa99 on X), USC Viterbi School of Engineering
gemma4-nemoclaw-dual-model.zip (2.4 KB)