Sparkview — GPU monitor tool with GB10-aware unified memory handling

parallelArchitect · April 17, 2026, 7:57am

I’ve been working on sparkview, a Python TUI monitor that accounts for coherent unified memory behavior on GB10 / DGX Spark systems.

Problem:

On GB10, nvmlDeviceGetMemoryInfo can return total ≈ MemTotal (~121 GB). That value does not reflect allocatable memory. In practice, usable capacity tracks MemAvailable, which accounts for kernel reservations and page cache pressure.

sparkview detects this condition at runtime and switches to MemAvailable for memory display.

Signals:

Memory display adjusted for UMA behavior (MemAvailable vs MemTotal)
PSI memory pressure (/proc/pressure/memory) — LOW / MOD / HIGH / CRITICAL
Load-gated clock states — IDLE / PASS / LOCKED / THROTTLED
ConnectX-7 throughput via sysfs — surfaces degraded links
Process list sorted by GPU memory
Automatic runtime detection — no configuration

Repo: https://github.com/parallelArchitect/sparkview

Requires validation on GB10 / DGX Spark hardware. If you run it on Spark, feedback is useful.

djordjestojanovic1992 · April 17, 2026, 9:17am

Can you include a Screenshot to showcase how it looks? Thanks

whpthomas · April 17, 2026, 9:39am

Not bad, I will give it a run.

Install (fixed)

git clone https://github.com/parallelArchitect/sparkview.git
cd sparkview

# create a virtual environment (recommended on DGX Spark)
python3 -m venv sparkview-venv

# activate it
source ~/sparkview/sparkview-venv/bin/activate

# install dependencies
pip install nvitop psutil rich textual

Run

~/sparkview.sh

#!/bin/bash
cd ~/sparkview
source sparkview-venv/bin/activate 
python3 main.py
deactivate

bernardlbmi3 · April 17, 2026, 2:03pm

Nice tool.

How I run it is you have uv installed:

uvx nvitop

elsaco · April 17, 2026, 2:42pm

sparkview look:

parallelArchitect · April 17, 2026, 8:17pm

@whpthomas appreciated — README updated with venv steps based on your feedback.

Quick note: sparkview runs as python3 main.py, not nvitop directly. The venv activates dependencies but the entry point is the script:

source ~/sparkview/sparkview-venv/bin/activate
python3 ~/sparkview/main.py

For persistent one-command launch, add the alias once:

echo "alias sparkview='source ~/sparkview/sparkview-venv/bin/activate && python3 ~/sparkview/main.py'" >> ~/.bashrc
source ~/.bashrc

Then just type sparkview from terminal.

Terminal height scaling — pushed in latest commit.

Thanks to @elsaco for the GB10 screenshot.

parallelArchitect · April 17, 2026, 8:49pm

@whpthomas @bernardlbmi3 one thing to flag — the nvitop screenshots appear to be using MemTotal as the denominator on GB10 rather than allocatable memory.

On coherent UMA systems, MemTotal does not reflect usable capacity. MemAvailable is a closer approximation since it accounts for kernel reservations and page cache.

That would explain the ~92.7% utilization shown — it may be overstating actual pressure rather than indicating a real memory constraint.

There’s an open PR addressing this behavior: Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX … by parallelArchitect · Pull Request #208 · XuehaiPan/nvitop · GitHub

whpthomas · April 17, 2026, 10:42pm

Thank you - that makes a lot more sense now. I am a low level C systems programmer since the 80’s who migrated to rust more recently. Python ¯_(ツ)_/¯

whpthomas · April 17, 2026, 11:47pm

A couple of wish list items that would help this be an ‘at a glance’ utility.

Color – turn from green / orange / red to highlight high loads intelegently. Obviously high memory use isn’t a problem unless we are swapping a lot, then it is.

Temp – have a single graph that shows the highest temp out of GPU or CPU (which would be in text right) max at100C , use colors to warn

If you added these, this would be really useful.

parallelArchitect · April 18, 2026, 4:44am

The feedback here has directly shaped the tool — venv setup, terminal scaling, and memory denominator clarification all came from this thread. Real-world Spark feedback is invaluable when building against hardware that isn’t directly accessible.

@whpthomas the Python is just the display layer — the underlying signals come from NVML and kernel-exposed UMA state. Open to a C/Rust perspective on the collection path if you want to dig in.

From your screenshot — CLOCK IDLE, P0 state, GPU 0%, CPU 0.0% — the tool correctly did not flag THROTTLED. VLLM::EngineCore had 91.9 GiB resident in unified memory, but the system was idle. The load gate prevented a false alarm — memory was occupied, not under pressure.

Both wishlist items are in v0.2.0, just pushed:

TEMP row — current and session peak for GPU and CPU, color-coded green / yellow / red at 60°C and 80°C thresholds
Smart color — ⚡UMA turns red when PSI HIGH, clock THROTTLED, or temp > 80°C
Anomaly auto-logger — when an issue is detected, sparkview writes a timestamped log to ~/sparkview_logs/ with a summary.json (trigger reason, peak temps, driver, CUDA, kernel). Logs are compressed on exit. Leaves a trace for post-incident inspection — no user action required.

Thanks for all the support and the real GB10 data — it directly improved the tool.

whpthomas · April 18, 2026, 7:09am

So how do I interpret this? I am thinking maybe I need to lower my vllm memory use?

GPU    ░░░░░░░░░░░░░░░░░░░░   1%  44°C  12.6W  Mem 114.0Gi/121.7Gi  ⚡UMA                        
                                                                                                 
MEM    ██████████████████░░ 93.7%  Used 114.0Gi / 121.7Gi                                        
                                                                                                 
SWAP   ░░░░░░░░░░░░░░░░░░░░  3.6%  Used 1.2Gi / 32.0Gi                                           
                                                                                                 
CPU    ░░░░░░░░░░░░░░░░░░░░  1.5%  Max 10%  Active 3/20                                          
                                                                                                 
CLOCK  ░░░░░░░░░░░░░░░░░░░░ IDLE      2411MHz / 3003MHz  P0                                      
                                                                                                 
UMA    ████████████████████ CRITICAL  some 0.81  full 0.81                                       
                                                                                                 
TEMP   █████████░░░░░░░░░░░ GPU 44°C↑44°C  CPU 46°C↑46°C                                         
                                                                                                 
INFO   05:08:01 PM  |  Driver 580.142  |  CUDA 13.0  |  Kernel 6.17.0-1014-nvidia  |  Up 4d 22h  
3m                                                                                               
────────────────────────────────────────────────────────────                                     
PROC   PID     USER        GPU-MEM  CPU%  CMD                                                    
       3177363 root        100.4Gi    0%  VLLM::EngineCore                                       
       4393    whpthomas     0.1Gi    0%  gnome-shell                                            
       4242    whpthomas     0.1Gi    0%  Xorg                                                   
       2756690 whpthomas     0.0Gi    0%  gnome-system-monitor

Update

I reduced the gpu_memory_utilization from 0.82 to 0.8 and that seem to have sorted it out. Very Handy! - nothing else was picking that up

parallelArchitect · April 18, 2026, 10:46am

Glad it was useful — this is exactly the signal the tool is designed to surface.

UMA pressure: CRITICAL (full 0.81)
GPU utilization: ~1%
Memory: 93.7% used
Swap active (~1.2 GiB)

No active GPU compute was occurring. The system was stalled in memory management — page reclaim and swap activity competing against ~114 GiB of resident model data. PSI shows ~81% of wall time spent in stalled states.

This aligns with the pre-failure condition described here:
https://forums.developer.nvidia.com/t/dgx-spark-becomes-unresponsive-zombie-instead-of-throwing-cuda-oom/353752

Those reports do not include pre-crash telemetry; PSI provides visibility into the transition.

v0.2.1

Adds IO PSI alongside memory PSI

On GB10-class systems running VLLM-style workloads, IO pressure is a second important signal — model shard loads and checkpoint writes directly compete with the unified memory pool and can accelerate stall conditions.

twaggs88 · April 18, 2026, 1:15pm

I extended this a little with the extra data available from here: https://github.com/antheas/spark_hwmon

Looks like I’m actively experiencing the PD-Bug…

parallelArchitect · April 18, 2026, 7:40pm

@twaggs88 nice work — that’s solid capture of the failure state.

611 MHz at P0 with ~96% GPU load, PROCHOT ACTIVE, and PL_LEVEL 1 indicates the GPU is operating under a constrained power condition. In this state, clocks are reduced to stay within the available power budget rather than scaling with load.

DC input at ~38.8 W under sustained GPU load is notably low. For context, GB10 is a 140 W-class part, and NVIDIA specifies a 240 W supply for normal operation:
https://docs.nvidia.com/dgx/dgx-spark/hardware.html

This suggests the system is power-limited, though it does not isolate whether the constraint is due to supply, cable, or negotiation.

Worth confirming the original PSU and cable are in use. If the supply is correct, a full power cycle may help — disconnect the brick from both the wall and the system, wait ~60 seconds, then reconnect to force a fresh PD negotiation.

The anomaly logger should have captured the full timeline under ~/sparkview_logs/.
If possible, share the summary.json from that run — it will show the trigger condition, clock behavior, power draw, and throttle state leading into the event.

Also interested in the spark_hwmon extension — exposing rail-level power data adds useful visibility for this class of issue.

Topic		Replies	Views
Small monitor program for DGX Spark DGX Spark / GB10 Projects monitoring	18	632	April 1, 2026
NVML Support for DGX Spark Grace Blackwell Unified Memory - Community Solution DGX Spark / GB10 Projects cuda , kernel	7	586	April 4, 2026
A TOP monitor program specific for the DGX SPARK DGX Spark / GB10 Projects	9	1586	March 16, 2026
DGX Dashboard metrics DGX Spark / GB10	6	697	October 27, 2025
DGX Spark becomes unresponsive (“zombie”) instead of throwing CUDA OOM DGX Spark / GB10	16	1198	April 10, 2026
Dgx spark can only use 119GB of memory running ComfyUI, so it is killed, isn't there 128GB of memory, why can't it be used? DGX Spark / GB10	3	440	December 4, 2025
Nsight Systems: Unified Memory Trace Support for GB10 (SM121) DGX Spark / GB10 nsight , feature-engineering , spark	6	257	February 5, 2026
Nsight Systems reports 193.51 GiB CUDA memory usage per B200 GPU, while only ~179 GiB is physically available Profiling Linux Targets cuda , pytorch	10	63	March 24, 2026
Special top with all relevant data inside for DGX Spark DGX Spark / GB10 Projects management	0	127	March 10, 2026
Memory Creep on DGX Spark: Where Your 128 GB Actually Goes (And How to Stop It) DGX Spark / GB10 jetson , nemotron	2	478	March 30, 2026

Sparkview — GPU monitor tool with GB10-aware unified memory handling

Install (fixed)

Run

Related topics