NVIDIA DGX Spark Cluster Dashboard

I love my Sparks, but I was annoyed by having two SSH sessions running btop on each node to monitor my two‑node cluster. I spent some time creating a web‑based, btop‑inspired dashboard that I hope will help others.

You can see it on GitHub - paul-aviles/NVIDIA-DGX-Spark-Dashboard: NVIDIA GDX Spark Dashboard .

Hello,

First of all, this is something that NVIDIA itself should ideally have provided, so I think it’s fantastic that you created this software.
I recently started clustering and had been hoping for a dashboard like this, so it was extremely helpful. Thank you very much.

I’d like to share a couple of thoughts after reading the README:

  • If Step 4 of the PlayBook “Connect Two Sparks” (https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) has already been completed, then Steps 1. to 3. may not be necessary. It might be helpful to mention that in such a case, users can simply proceed to editing backend/config.yaml.
  • There is no Step 4; it jumps directly to Step 5 :)

Hi there.. thanks for pointing out step 4 was missing. I just fixed that.

As for the other point, I created this not to run on the Sparks, but on another system hence you need Steps 1 , 2 and 3. You do need to have them cluster of course.

I don’t believe NVIDIA needs to provide it all, it will be great if they address the IB issues mentioned elsewhere here in the forums 100% agree, but this is not a huge issue for me.

They also did not provide btop, htop and other tools either. So, it is a complementary tool. Perhaps they have it on their large systems, I do not know.

Also, I know of an issue with the bar for the overall utilization on enP7s7 showing a full bar when the interface is not really used 100%, so will also look into that.

Paul

Thanks for pointing it out, fixed..

Nice! I made a dashboard as well. I’ve got a mixed setup of two RTX 5000s on one host and my spark is another. I still have a lot of work to do on it, but it’s shaping up.

I would like to say thank you. The values of ssd storage capacity as well as memory are not 100% exact, but it is a very nice dashboard which I use a lot for my spark cluster.

Maybe, adding some additional cpu and gpu power and temperature values GitHub - antheas/spark_hwmon: Linux hwmon driver for the NVIDIA DGX Spark (GB10 SoC) that exposes full system power telemetry via standard sensors / sysfs interfaces. · GitHub :-). I saw this the other day, which might be nice.

Thank you for the feedback. We have started implementing features to help cluster your devices in NV Sync. In the latest release, you can check it out in the Settings tab. In the next Sync release, we also plan to add more of a dashboard to monitor your cluster.