Hello, i’m currently using two DGX Spark systems connected via QSFP cables.
I have a question regarding memory usage. Each device is using approximately 15–20 GB of memory by default. Is there any way to reduce this baseline memory usage? Are there any methods or configurations that would allow us to lower the default memory consumption?
I’m fine with using a console-based environment without a GUI. My DGX Spark systems are used solely for LLM serving purposes.
I’m fairly certain I’ve seen this topic discussed on the forum before, but I’m having trouble finding the thread again, so I’m posting here.If there are any previous forum posts that discuss this same topic, I would really appreciate it if you could share the links.
On DGX OS (Ubuntu) the default run level is graphical.target (equivalent to the old fashioned init 5), which gives you the GUI that eats a fair share of the memory footprint you noticed.
You may want to run your Spark from the command line by switching to multi-user.target (equivalent to run level 3). You can do this permanently with:
Yep, this is the way. It won’t save much to be fair, if you never login into the desktop - 2-3GB tops. Most of the memory utilization you see if nothing is running is cached filesystem anyway.
Also (and it could be fixed now) when I tried to do it a few months ago, I had a couple of issues:
The dashboard would stop working - they didn’t have all the necessary components enabled in the multi-user.target. You can manually enable them though. Maybe it’s fixed now - I don’t know.
For some reason, NCCL performance was worse in multi-user.target than in graphical.target. I tried to figure this out and gave up. My only relevant theory was that in multi-user.target both NCCL and networking stack got pinned to the same CPU core, while in graphical mode they normally got assigned different cores.
Not sure why. Maybe because some very small, but constant background activity in graphical mode was enough for the scheduler to assign different cores, maybe something else. I haven’t tried IGNORE_GPU_AFFINITY back then, maybe it would help.
Again, that was back in November, but something to test if you go this route.