Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix

sggin1 · March 14, 2026, 6:45am

Running Nemotron-3-Super 120B on DGX Spark GB10 (sm_121) — build recipe,
benchmarks, and a non-obvious GGUF compatibility fix. Sharing in case
useful while native sm_121 support matures.

Key findings:

llama.cpp built natively for sm_121 (commit 463b6a963, CUDA 13.0,
Driver 580.126.09) runs at ~14.4 t/s — effectively identical to
Ollama’s 14.2 t/s at this scale (memory-bandwidth bound)
Ollama’s MoE GGUF blobs are NOT compatible with upstream llama.cpp:
blk.1.ffn_down_exps.weight has wrong shape; expected 4096 got 1024
Ollama packs expert weights differently — must use ggml-org GGUF separately
ggml-org Q4_K (66GB) saves ~20GB vs Ollama’s Q4_K_M (86GB)
OOM pitfall on load: drop page cache first (sudo sh -c ‘sync; echo 3 > /proc/sys/vm/drop_caches’)

Full package:

gist.github.com

https://gist.github.com/Sggin1/cd21ed471c861e814a85925ee04dfed6

BENCHMARK_RESULTS.md

# Benchmark Results: Nemotron-3-Super 120B on DGX Spark (GB10)

**Date:** 2026-03-14
**Hardware:** DGX Spark — GB10 Grace Blackwell, 128 GB unified memory, 20-core ARM Grace
**OS:** Ubuntu 24.04, kernel 6.17.0-1008-nvidia, CUDA 13.0, Driver 580.126.09

## Head-to-Head: Ollama vs llama.cpp Native Build

| | Ollama | llama.cpp (sm_121) |
|---|---|---|

This file has been truncated. show original

BUILD_RECIPE.md

# Build Recipe: llama.cpp with sm_121 on DGX Spark (GB10)

## Environment

| Component | Value |
|-----------|-------|
| Hardware | NVIDIA DGX Spark — GB10 Grace Blackwell |
| GPU | NVIDIA GB10, 128 GB unified VRAM, compute capability 12.1 |
| CPU | 20-core ARM (Grace) |
| OS | Ubuntu 24.04 |

This file has been truncated. show original

COMMUNITY_POST.md

# DGX Spark: Running Nemotron-3-Super 120B with llama.cpp (sm_121 native build)

**TL;DR:** Built llama.cpp from source targeting `sm_121` (Blackwell) on a DGX Spark GB10. Generation speed is **~14.4 t/s**, comparable to Ollama's **~14.2 t/s**. The sm_121 build is not meaningfully faster, but using the ggml-org Q4_K GGUF saves ~20 GB of memory compared to Ollama's Q4_K_M, which matters on shared unified memory. Sharing the full build recipe, benchmarks, and OOM workarounds.

---

## Setup

- **Hardware:** NVIDIA DGX Spark — GB10 Grace Blackwell, 128 GB unified VRAM, 20-core ARM
- **OS:** Ubuntu 24.04, kernel 6.17.0-1008-nvidia

This file has been truncated. show original

There are more than three files. show original

Files:

BUILD_RECIPE.md — exact build steps, environment, errors encountered
BENCHMARK_RESULTS.md — honest analysis, no inflated claims
run_server.sh — drop-in LAN server script
nemotron_super.modelfile — Ollama config with reasoning params
COMMUNITY_POST.md — full writeup

Hardware: DGX Spark GB10, 128GB unified memory, Ubuntu 24.04, CUDA 13.0

aniculescu · March 16, 2026, 9:04pm

Thank you for the fix, I will move this to GB10 projects

Kidtronic · March 21, 2026, 4:06pm

How’s the quality of Q4_K? Any chance of getting Q5_K_M to run?

sggin1 · March 22, 2026, 5:14am

thank you Kidtronic for the question, it inspired me to run some tests as I have been on other projects but heres is what I found ..

[earlier post] about running Nemotron-3-Super 120B with a native sm_121 llama.cpp build. Several people asked about the official NVFP4 checkpoint — can you run it on Spark? How does quality compare? We spent a full session trying to find out.

**Short answer:** The GGUF is the only path that works on Spark today. The NVFP4 checkpoint cannot load — blocked at the kernel level, not just config.

-–

## The Practical Reality: GGUF vs NVFP4 on DGX Spark

The GGUF path requires building llama.cpp from source (see [original post](COMMUNITY_POST.md) for the full recipe), but once you have the binary it’s straightforward:

1. Build llama.cpp targeting sm_121 (~3 min compile, one-time)

2. Download one file — the [ggml-org Q4_K GGUF]( https://huggingface.co/ggml-org/nemotron-3-super-120b-GGUF ) (66 GB)

3. Run `llama-server -m model.gguf`

The NVFP4 path requires a compatible vLLM build with CUDA 13 support, the right Docker image, correct quantization config handling, and NemotronH-specific MoE kernels. As of March 2026, none of these pieces fully exist for sm_121. We hit three separate layers of failure trying to make it work — details below.

## What Happened When We Tried NVFP4

### Layer 1: Config validation (fixable)

The community `avarok/vllm-dgx-spark` Docker image (vLLM 0.14, built Jan 2026) doesn’t recognize the NVFP4 model’s `MIXED_PRECISION` quantization config. The model was released March 11 — two months after the image was built. The model uses a mix of NVFP4 (40,961 layers) and FP8 (139 layers), which the older vLLM doesn’t expect.

```

ValueError: ModelOpt currently only supports: [‘FP8’, ‘FP8_PER_CHANNEL_PER_TOKEN’,

‘FP8_PB_WO’, ‘NVFP4’] quantizations in vLLM.

```

We patched the ModelOpt quantization module to accept `MIXED_PRECISION` and route it to the NVFP4 handler. Config validation passed.

### Layer 2: MoE kernel incompatibility (not fixable today)

```

NotImplementedError: No NvFp4 MoE backend supports the deployment configuration.

```

All 5 available NVFP4 MoE backends (FLASHINFER_TRTLLM, FLASHINFER_CUTEDSL, FLASHINFER_CUTLASS, VLLM_CUTLASS, MARLIN) require fused `act_and_mul` MLP layers. NemotronH uses `relu2` activation with separate projections in its LatentMoE — a different architecture than standard MoE models like Mixtral or DeepSeek. We force-selected Marlin specifically and got:

```

ValueError: NvFp4 MoE backend ‘MARLIN’ does not support the deployment

configuration since kernel does not support no act_and_mul MLP layer.

```

This is a kernel-level limitation. No config change or patch fixes it.

### Layer 3: No pip escape hatch either

Tried installing vLLM 0.18.0 natively (the version the model README recommends). The pip wheels are compiled against CUDA 12 — Spark runs CUDA 13. Immediate import failure on `libcudart.so.12`.

### What would unblock NVFP4

- vLLM 0.18+ built for CUDA 13 + sm_121 with NemotronH-compatible MoE kernels

- CUDA 13.2+ (may bring native NVFP4 dispatch for Blackwell)

- An updated `avarok/vllm-dgx-spark` image with these fixes

## GGUF Quality: What We Actually Tested

**Important caveat:** This is not a rigorous benchmark. We ran 8 prompts and checked responses for basic correctness (keyword presence, correct answers). We have no BF16 or NVFP4 baseline to compare against, so we cannot make claims about how much quality the Q4_K quantization loses. We can only say whether the responses are correct and coherent.

We used temperature=1.0 and top_p=0.95 — these are NVIDIA’s recommended settings for the NVFP4 model. We applied the same settings to the GGUF for consistency, though NVIDIA does not publish recommended settings for the GGUF specifically.

|------|-------:|------:|--------|

| Code Generation (Fibonacci w/ memoization) | 882 | 16.9 t/s | Correct implementation with input validation and complexity analysis |

| Lateral Thinking (Monopoly riddle) | 189 | 16.7 t/s | Correct answer (Monopoly) with explanation |

| General Knowledge (photosynthesis vs respiration) | 1,605 | 17.2 t/s | Accurate, well-organized comparison |

| Math (sum of odds 1-100) | 337 | 17.0 t/s | Correct answer (2500) with step-by-step arithmetic sequence formula |

| Instruction Following (5 Asian countries with ‘I’) | 224 | 16.9 t/s | Exactly 5 items, numbered as requested |

| Creative (haiku about GPU) | 340 | 17.0 t/s | Valid 5-7-5 syllable structure |

| Technical Analysis (TCP vs UDP) | 1,669 | 17.3 t/s | Accurate comparison with use-case recommendations |

| Multi-step Reasoning (“all but 9” trick) | 290 | 16.9 t/s | Correct answer (9) with reasoning about the phrasing |

**Average: 16.99 t/s across 5,536 total tokens generated.**

All responses were factually correct for the prompts we tested. The model handled a reasoning trick question (“all but 9”) and a lateral thinking puzzle (Monopoly) without stumbling. Whether this holds up on harder benchmarks (MMLU-Pro, LiveCodeBench, etc.) — we don’t know. NVIDIA’s model card shows NVFP4 scores ~0.4% below BF16 on MMLU-Pro (83.33 vs 83.73); the GGUF Q4_K would likely lose more since it’s post-training quantized rather than trained at lower precision, but we haven’t measured this.

## How to Reproduce

### Prerequisites

- DGX Spark with llama.cpp built for sm_121 (see [original post](COMMUNITY_POST.md) for build recipe — requires cmake, CUDA toolkit 13.0)

- The ggml-org GGUF: [ggml-org/nemotron-3-super-120b-GGUF]( ggml-org/Nemotron-3-Super-120B-GGUF · Hugging Face ) (Q4_K, 66 GB)

### Step 1: Clear the system

The 66 GB model needs ~71 GB at runtime. You need a clean memory state.

```bash

# Stop anything holding GPU memory ( For Example )

sudo systemctl stop ollama

pkill -f llama-server

# Drop page cache

sudo sh -c ‘sync; echo 3 > /proc/sys/vm/drop_caches’

# Verify you have headroom

free -g # check the “available” column, should show 110+ GB

```

### Step 2: Start the server

```bash

llama-server \

-m /path/to/Nemotron-3-Super-120B-Q4_K.gguf \

--port 8090 \

–host 0.0.0.0 \

-ngl 99 \

-fa on \

-c 8192 \

–metrics

```

Model load takes ~6 minutes (mmap’ing 66 GB). Wait until you see `HTTP server listening` in the output.

### Step 3: Test it

```bash

curl -s http://localhost:8090/v1/chat/completions \

-H “Content-Type: application/json” \

-d '{

"messages": \[{"role": "user", "content": "What is the sum of all odd numbers from 1 to 100?"}\],

"max_tokens": 2048,

"temperature": 1.0,

"top_p": 0.95

}’ | python3 -m json.tool

```

Or use the benchmark script which runs all 8 prompts and captures full responses + speed metrics:

```bash

git clone https://gist.github.com/ nemo3_super_benchmarks

cd nemo3_super_benchmarks

python3 quality_compare.py gguf

```

### Memory breakdown

| Component | Size |

|-----------|-----:|

| Model weights (mmap) | 66 GB |

| KV cache (8192 context) | ~4 GB |

| CUDA context + buffers | ~1 GB |

| **Total runtime** | **~71 GB** |

The `free` command may show low “free” but check the “available” column — Linux page cache is reclaimable. You should have ~50 GB available for other processes.

## Speed comparison across our tests

|------|-------|--------:|--------:|--------:|

| March 22 | llama.cpp build 924 (01d8eaa) | 2048 | 8 | **16.99** |

| March 14 | llama.cpp 463b6a963 | 512 | 4 | 14.43 |

The March 22 test used a newer llama.cpp build, higher max_tokens, and more prompts than the March 14 test — multiple variables changed, so we can’t attribute the speed difference to any single factor. The takeaway is that this model runs at roughly 14-17 t/s on Spark depending on build and settings.

## Bottom Line

If you’re on DGX Spark and want to run Nemotron-3-Super 120B today, the GGUF is the only working path. It requires a one-time llama.cpp build, a 66 GB download, and a single command to serve. Responses were correct across the prompts we tested, and generation runs at ~17 t/s.

The NVFP4 checkpoint is blocked by missing MoE kernel support in vLLM for NemotronH’s architecture. This isn’t a config issue — the kernels need to be updated for the LatentMoE layer structure. We’ll update when this changes.

-–

*Tested on DGX Spark (GB10, Ubuntu 24.04, CUDA 13.0, Driver 580.126.09), llama.cpp build 924 (01d8eaa), March 2026.*

Topic		Replies	Views
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	42	3332	February 7, 2026
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	9435	March 31, 2026
Nemotron-3-Super-120B-A12B-NVFP4 on single DGX Spark: 23.45 tok/s (spark-arena.com/ benhmarks) DGX Spark / GB10 cuda , benchmarks , spark , llm , nemotron , dgx , nemoclaw	6	668	May 26, 2026
DGX Spark, Nemotron3, and NVFP4: Getting to 65+ tps DGX Spark / GB10 spark , nemotron , dgx	14	2076	December 22, 2025
[Benchmark] nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4 DGX Spark / GB10 Projects cuda , spark , jetson , llm , nemotron	7	965	May 1, 2026
Running Nemotron 3 Super 120B on DGX Spark GB10— 72 hours continuous, 19 tok/s NVIDIA Nemotron llama , nemotron	3	206	March 28, 2026
OpenClaw w/ Nemotron-3-Super NVFP4 TensorRT inference on Spark Discussion DGX Spark / GB10 nemotron	14	1454	April 9, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2604	February 25, 2026
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 DGX Spark / GB10 jetson , llama , nemotron	2	453	May 9, 2026
DGX Spark performance DGX Spark / GB10	50	5234	February 27, 2026

Nemotron-3-Super 120B on GB10 — llama.cpp sm_121 build + Ollama GGUF incompatibility fix

Related topics