Guide: Gemma 4 31B on DGX Spark via NemoClaw — Dual-Model Setup Guide

papa1 · April 10, 2026, 6:24am

Overview

This guide documents how to run Google Gemma 4 31B IT alongside NVIDIA Nemotron 3 Super 120B on a DGX Spark, switching between them at runtime via a single shell command. Chief (the OpenClaw agent) runs inside a NemoClaw sandbox and communicates with models through OpenShell’s inference.local proxy.

This setup is notable because:

No one has publicly documented running dual local models through NemoClaw’s inference routing
The documented openshell inference set provider names (compatible-endpoint, vllm-local) do not work in openshell 0.0.21 — we found an undocumented openshell provider create workaround via a GitHub issue
Gemma 4’s thinking mode breaks OpenClaw agent responses — the fix required server-side template overrides, not config changes
The entire model swap (config + routing + restart) is automated in a single script

Architecture

┌─────────────────────────────────────────────────┐
│  DGX Spark (128GB unified memory)               │
│                                                 │
│  ┌──────────────────────┐                       │
│  │ NemoClaw Sandbox     │                       │
│  │ (OpenClaw 2026.4.9)  │                       │
│  │                      │                       │
│  │ inference.local ─────┼──► OpenShell Proxy    │
│  └──────────────────────┘         │             │
│                                   │             │
│                          ┌────────┴────────┐    │
│                          │                 │    │
│                    nvidia-prod        llama-cpp  │
│                          │                 │    │
│                    NIM (local)      localhost:8000│
│                    Nemotron 3       llama-server │
│                    Super 120B       (Gemma 4)   │
│                    (on Blackwell)                │
└─────────────────────────────────────────────────┘

Hardware & Software

Component	Detail
Hardware	DGX Spark, GB10 Grace Blackwell, 128GB LPDDR5X
OS	Ubuntu 24.04.4 LTS (aarch64)
CUDA	13.0 (V13.0.88)
openshell CLI	0.0.21
Gateway image	Package openshell/cluster · GitHub
NemoClaw CLI	v0.1.0
OpenClaw	2026.4.9
llama.cpp	Built from source (sm_121)
Model (NIM)	nvidia/nemotron-3-super-120b-a12b via NIM (local inference on Blackwell GPU)
Model (llama.cpp)	gemma-4-31B-it-f16.gguf (62GB, F16) via llama-server (local inference on Blackwell GPU)

Memory Budget

Measured Memory Usage (April 9, 2026)

With NemoClaw infrastructure + llama-server (Gemma 4 31B F16) running simultaneously:

$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi       108Gi       1.0Gi       263Mi        12Gi        13Gi
Swap:           15Gi       8.4Gi       7.6Gi

$ nvidia-smi (GPU processes)
  llama-server                    64107MiB
  Xorg                              268MiB
  gnome-shell                        230MiB
  firefox                          1047MiB
  telegram-desktop                    87MiB

Key observations:

Total system memory: 121GB usable of 128GB (unified CPU/GPU)
llama-server alone: ~64GB (model weights + KV cache for 65K context)
NemoClaw/OpenShell containers + OS + desktop: ~44GB
Free memory with Gemma loaded: ~1GB free, ~13GB available (buff/cache)
System is actively swapping 8.4GB to disk
GPU utilization: 6% idle, spikes during inference

Memory Budget Summary

Component	Approximate Memory
NemoClaw/OpenShell + k3s + Docker + NIM (Nemotron)	~40 GB
OS + Desktop (Xorg, GNOME, Firefox, Telegram)	~4 GB
Gemma 4 31B F16 model weights	~62 GB
KV cache (65K context, q8_0)	~2-3 GB
Total when both models active	~108 GB of 121 GB usable

⚠️ Memory Warning

Running Gemma 4 31B at F16 precision pushes the DGX Spark close to its limits (89% utilization, active swap). The system works but:

Stop llama-server when not in use (Ctrl+C) to free ~64GB
Close Firefox and unnecessary desktop apps to recover ~1-2GB
Monitor swap usage — heavy swapping degrades inference speed

Lighter Alternative: Q4_K_M Quantization

If you want Gemma 4 to coexist comfortably with the full NemoClaw stack without swapping:

# Download quantized model (~18GB instead of 62GB)
source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-Q4_K_M.gguf \
  --local-dir ~/models/gemma-4-31B-it-GGUF

# Start with quantized model
cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

No config or script changes needed — the model ID and OpenClaw config stay the same. Only the --model path changes.

Model Variant	Size	Memory Used	Swap?	Quality
F16 (current)	62GB	~64GB	Yes, 8.4GB	Maximum
Q4_K_M	~18GB	~20GB	No	Good (slight degradation)

Also Consider: Gemma 4 26B-A4B (MoE)

The 26B MoE variant activates only 3.8B parameters per forward pass:

Faster inference (45-60 tok/s reported vs ~3.5 tok/s for 31B dense)
Lower memory (~28GB at F16, ~10GB at Q4)
Community reports better stability on DGX Spark
Separate GGUF download required

Part 1: Building llama.cpp

Prerequisites

git --version    # 2.43.0+
cmake --version  # 3.28+
nvcc --version   # CUDA 13.0+

Install HuggingFace CLI

python3 -m venv ~/llama-cpp-venv
source ~/llama-cpp-venv/bin/activate
pip install -U "huggingface_hub[cli]"
hf version

Clone and Build

git clone https://github.com/ggml-org/llama.cpp ~/llama.cpp
cd ~/llama.cpp
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121" -DLLAMA_CURL=OFF
make -j8

Build takes ~5-10 minutes. Binary is at ~/llama.cpp/build/bin/llama-server.

Download the Model

source ~/llama-cpp-venv/bin/activate
hf download ggml-org/gemma-4-31B-it-GGUF \
  gemma-4-31B-it-f16.gguf \
  --local-dir ~/models/gemma-4-31B-it-GGUF

~62GB download. On DGX Spark with fast network, completed in ~2 minutes at 546MB/s. Resumable if interrupted.

Part 2: Starting the llama-server

The Critical Flags

cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

Wait for: main: server is listening on http://0.0.0.0:8000

Parameter Explanations

Parameter	Value	Why
`--host 0.0.0.0`	Bind all interfaces	Required — OpenShell’s Docker bridge network needs to reach the server
`--port 8000`	Port 8000	Matches the OpenShell provider config (default vLLM port)
`--n-gpu-layers 99`	Offload all layers to GPU	DGX Spark has enough VRAM
`--ctx-size 65536`	64K context window	OpenClaw’s full agent prompt (workspace files + system prompt) uses ~8000 tokens. 8192 is too small — caused context overflow. 65536 provides comfortable headroom
`--threads 8`	CPU threads	For non-GPU work
`--flash-attn on`	Enable flash attention	Performance optimization (requires explicit `on` value, bare flag errors)
`--cache-type-k q8_0`	KV cache quantization	Reduces memory for large context windows while maintaining quality
`--cache-type-v q8_0`	KV cache quantization	Same as above for values
`--jinja`	Enable Jinja templates	Required for `--chat-template-kwargs` to work
`--chat-template-kwargs '{"enable_thinking":false}'`	Disable thinking mode	THE CRITICAL FIX — see below

Why Thinking Mode Must Be Disabled

Gemma 4 31B IT has thinking/reasoning mode enabled by default in its chat template. When active:

The model puts all output in reasoning_content field instead of content
The content field is returned as an empty string ""
OpenClaw reads content, gets nothing, and returns “No reply from agent”
The think: false parameter in the API request works (tested via curl), but OpenClaw does not send it
Server-level --chat-template-kwargs '{"enable_thinking":false}' is the only reliable fix

Discovery process:

Without the flag: content: "", reasoning_content: "The answer is..." → OpenClaw sees empty response
With "think": false in curl request: content: "The answer is 4" → Works, but OpenClaw doesn’t send this
With --chat-template-kwargs '{"enable_thinking":false}': content: "The answer is 4", reasoning_content: "NONE" → Works for all requests

Note from Google docs: Larger Gemma 4 models (26B, 31B) may occasionally emit thought channels even when thinking is disabled. The --jinja template override is more reliable than request-level parameters.

Verify the Server (Outside Sandbox)

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma4","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":100}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); \
    print('content:', repr(d['choices'][0]['message'].get('content'))); \
    print('reasoning:', repr(d['choices'][0]['message'].get('reasoning_content','NONE')))"

Expected: content: '2 + 2 = 4', reasoning: 'NONE'

Part 3: OpenShell Provider Registration

The Problem

The NemoClaw sandbox cannot reach localhost:8000 directly — the sandbox runs in an isolated network namespace. All inference traffic must go through OpenShell’s inference.local proxy.

What Did NOT Work

The documented openshell inference set provider names are not implemented in openshell 0.0.21:

# ALL OF THESE FAIL with "provider not found":
openshell inference set --provider compatible-endpoint --model gemma-4-31B-it
openshell inference set --provider vllm-local --model gemma-4-31B-it
openshell inference set --provider vllm --model gemma-4-31B-it
openshell inference set --provider local --model gemma-4-31B-it
openshell inference set --provider custom --model gemma-4-31B-it
openshell inference set --provider openai-compatible --model gemma-4-31B-it
openshell inference set --provider "Local Inference" --model gemma-4-31B-it
openshell inference set --provider "OpenAI-Compatible" --model gemma-4-31B-it

Bug report filed on NemoClaw GitHub.

What DOES Work: `openshell provider create`

Found via NemoClaw GitHub issue #893. The key is host.openshell.internal — the DNS name the sandbox uses to reach the host machine.

openshell provider create --name llama-cpp \
  --type openai \
  --credential "OPENAI_API_KEY=unused" \
  --config "OPENAI_BASE_URL=http://host.openshell.internal:8000/v1"

This registers llama-cpp as a named provider. It persists across restarts — only needs to be run once.

Switching Inference

# Switch to Gemma
openshell inference set --provider llama-cpp --model gemma-4-31B-it --no-verify

# Switch to Nemotron
openshell inference set --provider nvidia-prod --model nvidia/nemotron-3-super-120b-a12b --no-verify

Verify from Inside Sandbox

nemoclaw my-assistant connect
curl -s https://inference.local/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma-4-31B-it","messages":[{"role":"user","content":"What is 2+2?"}],"max_tokens":50}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['choices'][0]['message'].get('content'))"
exit

Part 4: OpenClaw Configuration (What Changed and What Didn’t)

The Key Constraint

OpenClaw inside the sandbox talks to https://inference.local/v1 — a single endpoint proxied by OpenShell. You cannot have two providers simultaneously in openclaw.json. Switching models requires updating both the OpenShell routing AND the OpenClaw config.

What Changed Between Configs

Setting	Nemotron Config	Gemma Config
`models.providers.inference.api`	`openai-responses`	`openai-completions`
`models.providers.inference.models[0].id`	`nvidia/nemotron-3-super-120b-a12b`	`gemma-4-31B-it`
`models.providers.inference.models[0].name`	`inference/nvidia/nemotron-3-super-120b-a12b`	`inference/gemma-4-31B-it`
`models.providers.inference.models[0].contextWindow`	`131072`	`65536`
`models.providers.inference.models[0].compat`	(not present)	`{"requiresStringContent": true}`
`agents.defaults.model.primary`	`inference/nvidia/nemotron-3-super-120b-a12b`	`inference/gemma-4-31B-it`

What Did NOT Change

These stay identical across both configs:

baseUrl: https://inference.local/v1 (always through OpenShell proxy)
apiKey: unused
env.TAVILY_API_KEY
plugins.entries.tavily
gateway settings (token, auth, controlUi)
commands settings
channels.defaults
All other agent defaults

The compat Flag That Matters

"compat": {
  "requiresStringContent": true
}

requiresStringContent: true — Required. llama-server’s OpenAI-compatible endpoint expects string content, not structured content-part arrays. Without this, OpenClaw sends [{"type": "text", "text": "..."}] which causes invalid type: sequence, expected a string errors.

supportsTools: false — Do NOT use. Initially added based on inferrs documentation, but this flag causes OpenClaw to suppress valid agent responses. Removing it fixed the “No reply from agent” issue for complex questions.

The Configs

Both configs are stored on the host (outside Docker) at:

~/.nemoclaw/configs/openclaw-nemotron.json
~/.nemoclaw/configs/openclaw-gemma.json

The gateway regenerates the agent-level models.json from openclaw.json on every restart, so manual edits to ~/.openclaw/agents/main/agent/models.json inside the sandbox are wiped. Always edit the master config via docker cp.

Part 5: The Switch Script

Location

~/.nemoclaw/configs/switch-model.sh

Script

#!/bin/bash
set -euo pipefail

CONFIGS_DIR="$HOME/.nemoclaw/configs"
CONTAINER="openshell-cluster-nemoclaw"

if [[ "${1:-}" == "gemma" ]]; then
  CONFIG="$CONFIGS_DIR/openclaw-gemma.json"
  PROVIDER="llama-cpp"
  MODEL="gemma-4-31B-it"
  echo "Switching to Gemma 4 31B..."
elif [[ "${1:-}" == "nemotron" ]]; then
  CONFIG="$CONFIGS_DIR/openclaw-nemotron.json"
  PROVIDER="nvidia-prod"
  MODEL="nvidia/nemotron-3-super-120b-a12b"
  echo "Switching to Nemotron 3 Super 120B..."
else
  echo "Usage: bash switch-model.sh [gemma|nemotron]"
  exit 1
fi

# Switch inference provider
openshell inference set --provider "$PROVIDER" --model "$MODEL" --no-verify

# Update openclaw.json inside sandbox
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json" -path "*/rootfs/*" 2>/dev/null | head -1) && mv $ROOTFS ${ROOTFS}.old'
docker cp "$CONFIG" "$CONTAINER":/tmp/openclaw-new.json
docker exec -u root "$CONTAINER" sh -c 'ROOTFS=$(find /run/k3s/containerd -name "openclaw.json.old" -path "*/rootfs/*" 2>/dev/null | head -1 | sed "s|.old||") && cp /tmp/openclaw-new.json $ROOTFS'

# Restart services
nemoclaw stop
nemoclaw start

echo "Done. Use /new on Telegram to start a fresh session."

Usage

# Prerequisites: llama-server must be running for Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma

# Switch back to Nemotron via NIM (no llama-server needed)
bash ~/.nemoclaw/configs/switch-model.sh nemotron

After switching, always use /new on Telegram to start a fresh session.

Part 6: Complete Workflow

Starting Gemma 4 from Scratch

# 1. SSH to DGX Spark
ssh marcopapa@spark-dcce.taila3bbce.ts.net

# 2. Start llama-server (in one terminal, keep open)
cd ~/llama.cpp/build
./bin/llama-server \
  --model ~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 99 --ctx-size 65536 --threads 8 \
  --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking":false}'

# 3. In a second terminal, switch to Gemma
bash ~/.nemoclaw/configs/switch-model.sh gemma

# 4. On Telegram: /new, then chat with Gemma 4

Switching Back to Nemotron

# 1. Switch inference routing + config
bash ~/.nemoclaw/configs/switch-model.sh nemotron

# 2. (Optional) Stop llama-server to free GPU memory
# Ctrl+C in the llama-server terminal

# 3. On Telegram: /new, then chat with Nemotron

Part 7: Troubleshooting

“No reply from agent”

Cause	Fix
Thinking mode enabled	Restart llama-server with `--jinja --chat-template-kwargs '{"enable_thinking":false}'`
`supportsTools: false` in compat	Remove it — keep only `requiresStringContent: true`
Context overflow (ctx-size too small)	Increase `--ctx-size` (8192 is too small, 65536 works)
openclaw.json not updated	Run `switch-model.sh` or verify with docker exec cat
Stale Telegram session	Use `/new` to start fresh

“Context overflow: prompt too large”

The full OpenClaw agent prompt (SOUL.md, USER.md, AGENTS.md, IDENTITY.md, HEARTBEAT.md, TOOLS.md + system prompt) uses ~8000 tokens. The --ctx-size must be significantly larger than this to leave room for conversation.

ctx-size	Status
8192	❌ Too small — context overflow
32768	⚠️ Minimum viable
65536	✅ Recommended
131072	✅ Maximum headroom (more memory)

Server Shows 200 But No Reply

The server processed the request successfully but OpenClaw dropped the response. Check:

Is content empty in the response? (thinking mode issue)
Is compat.supportsTools set to false? (remove it)
Is the model ID in openclaw.json matching what the server expects?

Provider Not Found

openshell inference set --provider <name> only works with registered providers. In openshell 0.0.21, use openshell provider create to register custom providers first.

Part 8: Performance Notes (Measured)

Both models run locally on the DGX Spark’s Blackwell GPU — Nemotron via NIM, Gemma via llama.cpp.

Metric	Gemma 4 31B F16 (llama.cpp)	Nemotron 3 Super 120B (NIM)
Prompt processing	20-43 tokens/sec (measured)	Not exposed by NIM
Generation	3.3-3.6 tokens/sec (measured)	~35 tokens/sec (measured)
Simple query end-to-end	2-3 seconds (measured)	<1 second (measured)
Full agent response	3-4 seconds typical (measured)	~1 second typical (measured)
Context window	65K (configured), max 256K	131K
Memory footprint	~64 GB (F16 weights + KV cache)	Included in NIM container (~40 GB shared with k3s)
Inference backend	llama.cpp (llama-server)	NVIDIA NIM (optimized for Blackwell)
Cost	Free (local)	Free (local)
Tool calling	Works with `requiresStringContent` compat	Full native support
Thinking/reasoning	Must be disabled at server level	Native support
Quantization options	F16, Q8_0, Q4_K_M available	NIM-managed (optimized)

Measured Timings

Gemma 4 31B (from llama-server logs):

# Simple question ("What is 2+2?")
prompt eval time =  664ms /  27 tokens (24.6 ms/token, 40.6 tok/sec)
       eval time = 1998ms /   8 tokens (249.8 ms/token, 4.0 tok/sec)

# Full agent prompt ("What model are you running?")
prompt eval time =  810ms /  35 tokens (23.2 ms/token, 43.2 tok/sec)
       eval time = 2538ms /   9 tokens (282.1 ms/token, 3.5 tok/sec)

# Agent prompt + response ("Hello")
prompt eval time =  804ms /  24 tokens (33.5 ms/token, 29.8 tok/sec)
       eval time = 2864ms /  10 tokens (286.4 ms/token, 3.5 tok/sec)

Nemotron 3 Super 120B (from curl timing inside sandbox):

# Simple question ("What is 2+2? Reply in one sentence.")
# 28 prompt tokens → 33 completion tokens
# Total time: 929ms (end-to-end including prompt processing)
# ~35.5 completion tokens/sec

Note: The full agent prompt is ~8000 tokens (workspace files + system prompt), but most are cached from the previous turn — only new tokens need processing.

Why Nemotron Is Faster

Nemotron runs through NVIDIA NIM, which is optimized for the Blackwell architecture with TensorRT-LLM, quantization, and batching optimizations. llama.cpp is a general-purpose inference engine without Blackwell-specific optimizations. Despite being a much larger model (120B vs 31B parameters), Nemotron generates tokens ~10x faster thanks to NIM’s optimization stack.

When to Use Each Model

Use Nemotron when:

Speed matters — ~10x faster generation
Tool calling is needed — full native support
Complex multi-turn conversations — larger context window, better agent behavior

Use Gemma 4 when:

Experimenting with alternative models
Testing Google’s latest open model architecture
Privacy-sensitive tasks (no NIM dependency)
Teaching and demonstration purposes (CSCI 599)
Comparing model behaviors and capabilities

Part 9: After OpenClaw Updates

When upgrading OpenClaw inside the sandbox:

The openclaw.json configs at ~/.nemoclaw/configs/ are on the host — they survive sandbox rebuilds
After upgrade, diff the new default config against your saved versions
Update both openclaw-nemotron.json and openclaw-gemma.json if schema changes occurred
The openshell provider create registration persists — no need to recreate
The llama.cpp build and model files are on the host — they survive everything

Files Reference

File	Location	Purpose
llama-server binary	`~/llama.cpp/build/bin/llama-server`	Inference server
Gemma 4 model	`~/models/gemma-4-31B-it-GGUF/gemma-4-31B-it-f16.gguf`	62GB GGUF model file
Nemotron config	`~/.nemoclaw/configs/openclaw-nemotron.json`	OpenClaw config for Nemotron
Gemma config	`~/.nemoclaw/configs/openclaw-gemma.json`	OpenClaw config for Gemma
Switch script	`~/.nemoclaw/configs/switch-model.sh`	One-command model switching
HuggingFace venv	`~/llama-cpp-venv/`	Python venv for hf CLI

Credits

NVIDIA DGX Spark Playbooks: Run models with llama.cpp on DGX Spark | DGX Spark
NVIDIA NemoClaw docs: https://docs.nvidia.com/nemoclaw/latest/
OpenClaw inferrs provider docs: inferrs - OpenClaw
llama.cpp Gemma 4 discussion: Can't disable thinking in gemma4 (26b-a4b) · ggml-org/llama.cpp · Discussion #21338 · GitHub
Google Gemma 4 thinking docs: https://ai.google.dev/gemma/docs/capabilities/thinking
NemoClaw issue #893 (openshell provider create): Connecting Nemoclaw to self-hosted vLLM model on same host · Issue #893 · NVIDIA/NemoClaw · GitHub

Guide by Marco Papa (@marcopapa99 on X), USC Viterbi School of Engineering

gemma4-nemoclaw-dual-model.zip (2.4 KB)

Digital_David · April 10, 2026, 12:16pm

Fantastic write up!

I did not see how you installed the Nemotron 3 Super 120? Can you add a link to the Nim used in the Credits or a section on how that was installed?

chick_webb · April 10, 2026, 4:19pm

NIM deployment instructions here - nemotron-3-super-120b-a12b Model by NVIDIA | NVIDIA NIM

Topic		Replies	Views
NemoClaw on Spark DGX Spark / GB10 agentic-ai	56	4498	March 24, 2026
Gemma 4 Models - which vLLM version? Any PRs spotted? DGX Spark / GB10 nim , llama	157	6021	April 10, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	14	1194	April 9, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	7964	March 31, 2026
Gemma 4 Day-1 Inference on NVIDIA DGX Spark — Preliminary Benchmarks DGX Spark / GB10 llama , agentic-ai	17	4729	April 7, 2026
Google Gemma 4 - It will work on DGX Spark? DGX Spark / GB10 agentic-ai	22	1799	April 5, 2026
Gemma 4 31B on DGX Spark: Runtime FP8 Benchmarks — Single & Dual Node (TP=2) DGX Spark / GB10 llama , agentic-ai	0	571	April 7, 2026
vLLM Compatibility Problem with GPT OSS 120B and OpenClaw by spark-vllm-docker DGX Spark / GB10 cuda	21	2054	March 16, 2026
Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix DGX Spark / GB10 Projects llama , nemotron	3	707	March 22, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2099	February 25, 2026