Again, I’m trying to minimise the noise on a topic that’s purely supposed to be around getting nvtop running on the Spark.
So firstly, nvtop is not intended to monitor system RAM, while it has a basic process view at the bottom that will tell you the system ram used by processes, it’s specifically designed to monitor GPU resources, i.e. total VRAM + used VRAM. On a unified memory system, your VRAM is what is actually available for the GPU to allocate.
It is good to know how much VRAM you actually have available to the GPU as this is not always obvious. If a process is using a lot of system ram, it reduces the VRAM you have available. As a simple example, if we consider this screenshot below:
This is a training process running in Ostris’ AI toolkit. Now from this, I can see I only have about 64GB of VRAM in total. From this, I can immediately tell there is a huge amount of system ram allocated, but I wouldn’t expect the training process to use this much system ram, which likely indicates that there is a software bug or a misconfigured setting that needs to be fixed, perhaps a model is getting cached in system ram when it doesn’t need to be, or something that is currently running in system ram on the CPU should really be running on the GPU instead.
If all you have is the total unified memory, and amount of unified memory used, you’re completely flying blind. From that, all you can tell is that the process is using about ~90GB. It all looks fine from the outside, when in reality, there’s clearly a problem as you would expect the system ram usage to be closer to 3GB, instead we’re seeing usage of over 40GB. By knowing our available VRAM is 64GB, we can tell at a glance that there’s a problem.
In short: it’s just one additional data point that’s extremely useful, since without it, you would have no visibility on an issue like the one I mentioned above.
Some tips for you based on your screenshot, learn about tmux, it will make it so you don’t need to have all those terminal windows open, it also allows you to keep processes running even if the ssh session dies, and allows you to reconnect to that session, including all its windows and stuff. I often have nvtop and htop running at the same time which gives me all the information I need, though I do find myself using nvtop 99% of the time as it tells me what I need to know from a GPU perspective.
As for all the different metrics being confusing, it shouldn’t be confusing, basically, you have 128GB of RAM in total, the system seems to allocate part of that to something, perhaps related to iGPU functionality, but that’s just a guess. The rest is available to the Linux system, software running on it, and the GPU. What’s not being used by the OS and all the other software is available as VRAM that can be allocated by the GPU. So different tools will give you different numbers depending on what exactly it’s showing you. As mentioned, the main tools I suggest using is htop and nvtop, I don’t personally use Nvidia’s web-based monitoring, as I’m not convinced it gives an accurate view of the usage to the degree that the other tools I mentioned do.