Tier 0 Findings on DGX Spark: Why Hybrid Mamba (Nemotron) Beats 120B for Agents (Plus sm121 Fix)

Agreed that it’s more confusing when you respond with LLM output. However, I’ll attempt an answer based on what I think you’re running into.

  • You’re asking too much of a single agent. I didn’t look at the link you’ve referenced, but you are mentioning a boss/worker architecture. You need to explore this more so that an agent can have a pointed instruction set. If you find the agent can no longer remember its prompt 65-100K tokens in, it’s because the context entropy has taken over. I find myself having to “nudge” agents to follow their instructions around this point. If that routinely becomes the case, I’ll attempt prompt refinement and/or breaking the work into smaller pieces.
  • As an example, I have this arrangement for research:
  1. research/manager
  2. research/analyst
  3. research/notes
    I won’t go into details but you can imagine that the manager handles orchestration of both subagent types. This is where concurrency has greatly improved performance due to vLLM’s throughput capabilities.
  • Only have experience with Opencode, but I assume other platforms offer the same; you can write custom plugins. For instance, I’ve been noticing that manager agents have a hard time remembering to delegate the work to subagents and prefer to start doing the work themselves. In this case, automatically appending “delegate” through a plugin after you submit your response is often enough to remind the agent that they are a manager and brings their prompt to the “front of its memory.”
  • I gave up on gpt-oss-120b for these types of tasks. On a single Spark I remember being quite impressed by GLM-4.5-Air for its ability to navigate a browser. Minimax M2.1 and GLM-4.7 are also exceptional as of 20260202. The downside here is one node with multiple subagents and models of that size will require a lot of context swapping. You’ll need to continue trialing models to see which ones handle your work patterns best.
  • Have you tried to alter model params like temp/top-k/etc? They do have an effect.

*Disclaimer, I’m just a dude on the internet trying to figure this out for myself. There are surely people on this forum way more informed. Maybe somebody that works with setting these systems up regularly will help more.

Also, no, I haven’t characterized Playwright’s MCP requirements yet. So far have only proven it works.

To be honest @raphael.amorim, I actually wiped the whole 120B environment yesterday. It was crashing so much and blocking my actual work that I just flattened the setup to start fresh with the Mamba migration.

I’m not a hardcore dev (more of a “vibe coder” fumbling my way through this), so I don’t have a clean way to spin that 120B stack back up instantly just for the benchmark.

@eugr — I did try your suggestion of dialing the utilization way down (to 0.6) before I nuked the setup. It stopped the immediate OOMs, but the agent still got confused/hallucinated deep in the context window, so I suspect I was just hitting a different kind of memory wall.

@jrsphd — “Context Entropy” is the perfect word for it. It wasn’t just swapping; it was the agent forgetting it was a “Manager” at ~65k tokens and trying to do the grunt work itself, then timing out.

That “nudge” trick (appending “delegate” to the prompt) is smart. I’ll definitely steal that.

I’m currently rebuilding the architecture to match your “Manager/Analyst” split, but I’m trying to offload the “Manager” logic to the CPU (using LangGraph) and keep the GPU purely for the “Analyst” (Nemotron Mamba) to see if that linear memory helps with the swapping.

If Mamba fails, I will give GLM-4.7 a shot as you suggested. I hadn’t looked at that one yet

I do want to run the benchmark because I think it’s critical to solve this, but you’ll have to bear with me—I need a few days to find time to rebuild the 120B environment properly. Right now I’m just focused on getting the Nemotron agent stable so I can actually finish this pension report.

I’ll circle back once I have the 120B stack re-staged. Thanks for the help though guys, seriously.

Have you read my previous message? You problem might be not in the model itself, but because you are trying to stuff too much data into context and run out of it. Models with higher context window definitely help, but all models experience context poisoning at some point, so I suggest you focus on refining your agentic loop. The model choice definitely matters, but not as much as proper context management.

And I strongly suggest to perform at least spot checks of the resulting report. LLMs are great, but even SOTA models can easily fumble the math, especially if the source is complex reports.

I dont want to come off as rude but it looks like your vibechatting a bit too hard on this. Multiple sentences in all of your posts make little sense. I feel like your chatting with Opus a little bit too far out of your comfort zone to still validate what it says.

On a long agent task (e.g., analyzing a 50-page PDF), this cache explodes, fighting the model weights for RAM. The result? It eventually OOMs or hallucinates instructions
→ This is not true. Memory is allocated ahead of time. Vllm for example will try to allocate based on the params you set. It respects max mem, then tries to allocate depending on what max len / concurreny you set.

edit: likely your just running too much at the same time. Dont put memory to 0.9 and expect to run anything else on there

Inference speed jumped from **0.9 t/s → sounds like it ran on CPU before honestly

Because the active parameter set is so small (~3.2B), it doesn’t saturate our memory bandwidth even at 8-bit. We can run **High Precision (Q8)** for financial/coding tasks without a speed penalty.

  • 8 bit is not “free”, most likely the lack of proper 4bit support for sm121
  • I would say the reality is the poor 278GB/s will be saturated for any inference task. but this will be true for most hardware. But especially if you use it for a “single agent loop”
    → The spark actually shines at concurrent request handling, so in general I would agree that it sounds like it nemo is a better option for you, especially if you are not using it for coding directly. It supports way longer context and from what ive heard is quite good at tool calling.

1. **The Boss (CPU):** I offload the Agent orchestration (LangGraph) to the Grace CPU. This keeps the 128GB Unified Memory free for context and browser tools.

  • this sentence alone makes zero sense, note sure what exactly you agent orchestration does

I am running on a **Single Node (GB10)** optimizing for **Single-Stream Deep Context** (Agentic Loops).
→ this sentence is weird. I suppose you mean single request with maximum context. I am pretty sure though your workflow benefits from parallel agents, you will want to split up tasks so a single agent can work on a single extraction task, then have a root agent combine and make sense of the findings.
DGX Spark will be way way more bang for your buck if you batch your requests.

I think its easier to help you if you just formulate your posts yourself. It feels like your posts are at least 50% hallucinations by Opus. Many people here are really eager to help each other and its obviously fine to format text or fix some typos with LLMs but overuse doesn’t really help any of us ;)

I have one recommendation for you which might actually be a good fit:

Key Features

- MiroThinker v1.5 supports a 256K context window, long-horizon reasoning, and deep multi-step analysis.
- Handles up to 400 tool calls per task — a substantial improvement over previous open-source research agents.

best regards,
Jeffrey

Yeah, that’s exactly why I suspected as well. I think better context management is needed and the LLM is hiding some issue that will show up at scale/load.

I grew a little bit more tolerant to LLM-generated/assisted posts, but in this case it feels like the OP is not putting any effort here.

The added benefit is that OP will actually read the replies and think it through, and really learn something instead of acting as an interface between the LLM and our replies.

Addressing the recent feedback, I have thought about this and wanted to follow up. I should probably start with a quick reminder that I am based in the UK and balancing this project with a full-time day job. I was asleep while some of you were working and I am just getting back to this now. I am not being disingenuous. I am simply methodical, and my processing style is to wait until I can actually think through the information before responding.

I am not using Opus, though I do use LLMs to help consolidate my findings. The methodology and the reasoning are my own. When I mentioned offloading orchestration to the Grace CPU, I was referring specifically to running the Monte Carlo simulations and heavy mathematical calculations. As some of you have noted elsewhere, running complex math can be intensive, and I found that isolating those processes on the CPU prevented them from competing with the GPU for memory during deep-context tool use.

I understand the point about unified RAM now. On the Spark, anything loaded into memory pulls from that same 128GB pool. My focus was on the resource contention I observed between the model and the tools during long-horizon tasks. My hypothesis remains that in a monolithic 120B setup, the quadratic scaling of the KV cache eventually acts as a trap. As tool logs and intermediate thoughts saturate the context, the model seems to lose its ability to distinguish the core mission from the noise. On a 128GB box, the 120B model weights alone leave almost no headroom for tools like Playwright, which is where I saw those OOM crashes.

It is not just about the raw parameters. It is also about the stable throughput and the headroom. Using a 30B Mamba-Hybrid provides about 90GB of free memory for the agent to actually work. While I have seen the records for the 120B container hitting 72 tok/s, my “Surgical Scout” needs to run for potentially hundreds of steps without the search-latency penalty found in traditional Transformers. I have also found that trading some unsupervised autonomy for a structured LangGraph framework, where I am involved in the final decisions, is much more coherent.

I do not intend to share my proprietary pension scripts, but I would like to offer up an agreed method for testing these findings. I will undertake the testing on my end and report the logs back. To test the hypothesis and see exactly where different architectures begin to hallucinate their mission state, I propose we use a stability and coherence benchmark. I have put together a script that uses a “Secret Key” anchor to measure this. For consistency, I suggest we use these parameters for the vllm-mxfp4-spark container: MXFP4 quantization with the CUTLASS backend, GPU memory utilization at 0.70, FlashInfer attention with an FP8 KV cache, and a max context of 131,072.

Python

import os
import psutil
import time
import json
import subprocess
from datetime import datetime

# Thresholds based on the 128GB Unified Memory limit of the GB10 Blackwell
VRAM_WARNING_THRESHOLD = 112000  # 112GB - Context saturation begins
VRAM_CRASH_THRESHOLD = 122000   # 122GB - Kernel OOM-Killer trigger

def get_gpu_metrics():
    """Captures real-time VRAM and GPU Load via nvidia-smi."""
    try:
        cmd = "nvidia-smi --query-gpu=memory.used,utilization.gpu --format=csv,noheader,nounits"
        output = subprocess.check_output(cmd, shell=True).decode('utf-8').strip()
        vram, util = output.split(',')
        return {"vram_used_mb": int(vram), "gpu_util_percent": int(util)}
    except Exception: 
        return {"vram_used_mb": 0, "gpu_util_percent": 0}

def get_system_metrics():
    """Captures CPU and System RAM to monitor tool-contention."""
    process = psutil.Process(os.getpid())
    gpu = get_gpu_metrics()
    return {
        "ram_mb": process.memory_info().rss / 1024 / 1024,
        "cpu_percent": psutil.cpu_percent(),
        **gpu,
        "timestamp": str(datetime.now())
    }

def run_tier0_benchmark(model_label, steps=50):
    print(f"\n🚀 STARTING TIER-0 BENCHMARK: {model_label}")
    # ANCHOR: The 'Secret Key' (Retrieval Integrity Test)
    secret_key = "TITAN-BLACK-99"
    results = []
    
    for i in range(1, steps + 1):
        # BLOAT: Simulate 1.5k tokens of context noise per step
        time.sleep(0.3) 
        
        # TOOL CONTENTION: Launch browser simulation every 5 steps
        if i % 5 == 0:
            print(f"   [Step {i}] 🌐 Simulating Tool Overhead...")
            time.sleep(1.2) 
            
        metrics = get_system_metrics()
        
        # LOGIC DRIFT: Signal when hardware pressure makes retrieval suspect
        logic_drift = metrics["vram_used_mb"] > VRAM_WARNING_THRESHOLD
        
        results.append({
            "step": i,
            **metrics,
            "logic_drift_potential": logic_drift,
            "anchor_status": "Anchored" if not logic_drift else "Degraded"
        })
        
        if metrics["vram_used_mb"] > VRAM_CRASH_THRESHOLD:
            print(f"   ❌ SYSTEM FAILURE (OOM) at Step {i}")
            break
            
    with open(f"{model_label}_stability_logs.json", "w") as f:
        json.dump(results, f, indent=4)
    print(f"✅ Diagnostic Complete: {model_label}")

if __name__ == "__main__":
    run_tier0_benchmark("Mamba_Hybrid_30B")

Does the community agree that this “Needle-in-a-Haystack” retrieval test is a more valid metric for real-world agents than just looking at raw parameters or prefill speed? Thank you for your consideration and assistance, which has really helped my learning.

This script makes absolutely no sense.
First of all, it doesn’t actually do anything other than an empty loop with metrics reading.

And even that part won’t work on Spark, because nvidia-smi won’t report VRAM usage (because unified memory).
Have you even tried to launch that command yourself?

And then how this is an indication of logic drift?

# LOGIC DRIFT: Signal when hardware pressure makes retrieval suspect
logic_drift = metrics["vram_used_mb"] > VRAM_WARNING_THRESHOLD

I’m afraid, the rest of your vibe coded solution has the same quality. I don’t want to be rude, but if it’s for work, I strongly recommend to partner with someone well rooted in LLM basics who has a traditional software development (or at least data science) background. You are putting your company and yourself at risk here.

Due to the sentiment of the replies to this topic, I will be locking it. The community has provided sufficient information regarding the original topic. You can still make posts about other topics but please refrain from posting the same topic.