A TOP monitor program specific for the DGX SPARK

Hey all I have put together a TOP like program to monitor your Disk, and network IO performance, as well as your CPU and GPU load and temperature real time values, in a nice SSH supported TUI,

Check it out at GitHub - GigCoder-ai/dgxtop: An Enhanced TOP program to monitor your Nvidia DGX SPARK's Hardware

This was written in Python, with GLM-4.5 Air 235b running in a Ray Cluster across 2 sparks.

Please provide feedback, and let me know what you think!

Max

Wow! this is super nice!!!

Can it display the status of two sparks on one top instance? That’s probably a hard edge case since it has to communicate via ethernet…

Nice work @maxvamp !

May I suggest to get the information from procfs and sysfs instead of relying on other apps, or use nvidia-nvml-dev Python wheel. See Package Index

Like the GPU stats, with the default update interval nvidia-smi is being run ever second:

 cmd = [
         "nvidia-smi",
          f"--query-gpu={self.QUERY_FIELDS}",
          "--format=csv,noheader,nounits",
]

Details at dgxtop/dgxtop/gpu_monitor.py at b443dc63d2beeda075ef8b47325673665d06958c · GigCoder-ai/dgxtop · GitHub

It be cool if it would show processes and gpu usage like in nvidia-smi

(edit: I did a fork and quick implementation of this idea GitHub - sonusflow/dgxtop: An Enhanced TOP program to monitor your Nvidia DGX SPARK's Hardware )

This is amazing :)

Thats cool,

quick question - does anyone know how to get the System Memory and GPU Utilisation stats from the DGX Dashboard into the System Tray or in the bar at the top of the screen.

So far I am using

But would be nice to have a graph instead of just a plain number too.

I also developed one written in Rust, with easier installation and more detailed information. Everyone is welcome to give it a try.

Regular computers can use it too.

I saw this GitHub - antheas/spark_hwmon: Linux hwmon driver for the NVIDIA DGX Spark (GB10 SoC) that exposes full system power telemetry via standard sensors / sysfs interfaces. · GitHub the other day.

I am using Vitals now

I use sparkrun cluster monitor. It monitors the whole cluster, I can monitor it from anywhere, is part of sparkrun which belongs to our community, and is well maintained by Drew @dbsci :

Kind of a no-brainer