I’ve been working on sparkview, a Python TUI monitor that accounts for coherent unified memory behavior on GB10 / DGX Spark systems.
Problem:
On GB10, nvmlDeviceGetMemoryInfo can return total ≈ MemTotal (~121 GB). That value does not reflect allocatable memory. In practice, usable capacity tracks MemAvailable, which accounts for kernel reservations and page cache pressure.
sparkview detects this condition at runtime and switches to MemAvailable for memory display.
Signals:
-
Memory display adjusted for UMA behavior (MemAvailable vs MemTotal)
-
PSI memory pressure (/proc/pressure/memory) — LOW / MOD / HIGH / CRITICAL
-
Load-gated clock states — IDLE / PASS / LOCKED / THROTTLED
-
ConnectX-7 throughput via sysfs — surfaces degraded links
-
Process list sorted by GPU memory
-
Automatic runtime detection — no configuration
Repo: https://github.com/parallelArchitect/sparkview
Requires validation on GB10 / DGX Spark hardware. If you run it on Spark, feedback is useful.
Can you include a Screenshot to showcase how it looks? Thanks
Not bad, I will give it a run.
Install (fixed)
git clone https://github.com/parallelArchitect/sparkview.git
cd sparkview
# create a virtual environment (recommended on DGX Spark)
python3 -m venv sparkview-venv
# activate it
source ~/sparkview/sparkview-venv/bin/activate
# install dependencies
pip install nvitop psutil rich textual
Run
~/sparkview.sh
#!/bin/bash
cd ~/sparkview
source sparkview-venv/bin/activate
python3 main.py
deactivate
2 Likes
Nice tool.
How I run it is you have uv installed:
uvx nvitop
@whpthomas appreciated — README updated with venv steps based on your feedback.
Quick note: sparkview runs as python3 main.py, not nvitop directly. The venv activates dependencies but the entry point is the script:
source ~/sparkview/sparkview-venv/bin/activate
python3 ~/sparkview/main.py
For persistent one-command launch, add the alias once:
echo "alias sparkview='source ~/sparkview/sparkview-venv/bin/activate && python3 ~/sparkview/main.py'" >> ~/.bashrc
source ~/.bashrc
Then just type sparkview from terminal.
Terminal height scaling — pushed in latest commit.
Thanks to @elsaco for the GB10 screenshot.
@whpthomas @bernardlbmi3 one thing to flag — the nvitop screenshots appear to be using MemTotal as the denominator on GB10 rather than allocatable memory.
On coherent UMA systems, MemTotal does not reflect usable capacity. MemAvailable is a closer approximation since it accounts for kernel reservations and page cache.
That would explain the ~92.7% utilization shown — it may be overstating actual pressure rather than indicating a real memory constraint.
There’s an open PR addressing this behavior: Fix incorrect memory reporting on coherent UMA platforms (GB10 / DGX … by parallelArchitect · Pull Request #208 · XuehaiPan/nvitop · GitHub
Thank you - that makes a lot more sense now. I am a low level C systems programmer since the 80’s who migrated to rust more recently. Python ¯_(ツ)_/¯
A couple of wish list items that would help this be an ‘at a glance’ utility.
Color – turn from green / orange / red to highlight high loads intelegently. Obviously high memory use isn’t a problem unless we are swapping a lot, then it is.
Temp – have a single graph that shows the highest temp out of GPU or CPU (which would be in text right) max at100C , use colors to warn
If you added these, this would be really useful.
The feedback here has directly shaped the tool — venv setup, terminal scaling, and memory denominator clarification all came from this thread. Real-world Spark feedback is invaluable when building against hardware that isn’t directly accessible.
@whpthomas the Python is just the display layer — the underlying signals come from NVML and kernel-exposed UMA state. Open to a C/Rust perspective on the collection path if you want to dig in.
From your screenshot — CLOCK IDLE, P0 state, GPU 0%, CPU 0.0% — the tool correctly did not flag THROTTLED. VLLM::EngineCore had 91.9 GiB resident in unified memory, but the system was idle. The load gate prevented a false alarm — memory was occupied, not under pressure.
Both wishlist items are in v0.2.0, just pushed:
-
TEMP row — current and session peak for GPU and CPU, color-coded green / yellow / red at 60°C and 80°C thresholds
-
Smart color — ⚡UMA turns red when PSI HIGH, clock THROTTLED, or temp > 80°C
-
Anomaly auto-logger — when an issue is detected, sparkview writes a timestamped log to ~/sparkview_logs/ with a summary.json (trigger reason, peak temps, driver, CUDA, kernel). Logs are compressed on exit. Leaves a trace for post-incident inspection — no user action required.
Thanks for all the support and the real GB10 data — it directly improved the tool.
So how do I interpret this? I am thinking maybe I need to lower my vllm memory use?
GPU ░░░░░░░░░░░░░░░░░░░░ 1% 44°C 12.6W Mem 114.0Gi/121.7Gi ⚡UMA
MEM ██████████████████░░ 93.7% Used 114.0Gi / 121.7Gi
SWAP ░░░░░░░░░░░░░░░░░░░░ 3.6% Used 1.2Gi / 32.0Gi
CPU ░░░░░░░░░░░░░░░░░░░░ 1.5% Max 10% Active 3/20
CLOCK ░░░░░░░░░░░░░░░░░░░░ IDLE 2411MHz / 3003MHz P0
UMA ████████████████████ CRITICAL some 0.81 full 0.81
TEMP █████████░░░░░░░░░░░ GPU 44°C↑44°C CPU 46°C↑46°C
INFO 05:08:01 PM | Driver 580.142 | CUDA 13.0 | Kernel 6.17.0-1014-nvidia | Up 4d 22h
3m
────────────────────────────────────────────────────────────
PROC PID USER GPU-MEM CPU% CMD
3177363 root 100.4Gi 0% VLLM::EngineCore
4393 whpthomas 0.1Gi 0% gnome-shell
4242 whpthomas 0.1Gi 0% Xorg
2756690 whpthomas 0.0Gi 0% gnome-system-monitor
Update
I reduced the gpu_memory_utilization from 0.82 to 0.8 and that seem to have sorted it out. Very Handy! - nothing else was picking that up
Glad it was useful — this is exactly the signal the tool is designed to surface.
No active GPU compute was occurring. The system was stalled in memory management — page reclaim and swap activity competing against ~114 GiB of resident model data. PSI shows ~81% of wall time spent in stalled states.
This aligns with the pre-failure condition described here:
https://forums.developer.nvidia.com/t/dgx-spark-becomes-unresponsive-zombie-instead-of-throwing-cuda-oom/353752
Those reports do not include pre-crash telemetry; PSI provides visibility into the transition.
v0.2.1
- Adds IO PSI alongside memory PSI
On GB10-class systems running VLLM-style workloads, IO pressure is a second important signal — model shard loads and checkpoint writes directly compete with the unified memory pool and can accelerate stall conditions.
I extended this a little with the extra data available from here: https://github.com/antheas/spark_hwmon
Looks like I’m actively experiencing the PD-Bug…
@twaggs88 nice work — that’s solid capture of the failure state.
611 MHz at P0 with ~96% GPU load, PROCHOT ACTIVE, and PL_LEVEL 1 indicates the GPU is operating under a constrained power condition. In this state, clocks are reduced to stay within the available power budget rather than scaling with load.
DC input at ~38.8 W under sustained GPU load is notably low. For context, GB10 is a 140 W-class part, and NVIDIA specifies a 240 W supply for normal operation:
https://docs.nvidia.com/dgx/dgx-spark/hardware.html
This suggests the system is power-limited, though it does not isolate whether the constraint is due to supply, cable, or negotiation.
Worth confirming the original PSU and cable are in use. If the supply is correct, a full power cycle may help — disconnect the brick from both the wall and the system, wait ~60 seconds, then reconnect to force a fresh PD negotiation.
The anomaly logger should have captured the full timeline under ~/sparkview_logs/.
If possible, share the summary.json from that run — it will show the trigger condition, clock behavior, power draw, and throttle state leading into the event.
Also interested in the spark_hwmon extension — exposing rail-level power data adds useful visibility for this class of issue.