NVIDIA DGX Spark Cluster Dashboard

I love my Sparks, but I was annoyed by having two SSH sessions running btop on each node to monitor my two‑node cluster. I spent some time creating a web‑based, btop‑inspired dashboard that I hope will help others.

You can see it on GitHub - paul-aviles/NVIDIA-DGX-Spark-Dashboard: NVIDIA GDX Spark Dashboard .

7 Likes

Hello,

First of all, this is something that NVIDIA itself should ideally have provided, so I think it’s fantastic that you created this software.
I recently started clustering and had been hoping for a dashboard like this, so it was extremely helpful. Thank you very much.

I’d like to share a couple of thoughts after reading the README:

  • If Step 4 of the PlayBook “Connect Two Sparks” (https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks) has already been completed, then Steps 1. to 3. may not be necessary. It might be helpful to mention that in such a case, users can simply proceed to editing backend/config.yaml.
  • There is no Step 4; it jumps directly to Step 5 :)

Hi there.. thanks for pointing out step 4 was missing. I just fixed that.

As for the other point, I created this not to run on the Sparks, but on another system hence you need Steps 1 , 2 and 3. You do need to have them cluster of course.

I don’t believe NVIDIA needs to provide it all, it will be great if they address the IB issues mentioned elsewhere here in the forums 100% agree, but this is not a huge issue for me.

They also did not provide btop, htop and other tools either. So, it is a complementary tool. Perhaps they have it on their large systems, I do not know.

Also, I know of an issue with the bar for the overall utilization on enP7s7 showing a full bar when the interface is not really used 100%, so will also look into that.

Paul

Thanks for pointing it out, fixed..

Nice! I made a dashboard as well. I’ve got a mixed setup of two RTX 5000s on one host and my spark is another. I still have a lot of work to do on it, but it’s shaping up.