FabricManager will not run

I had no idea that renaming the unit would incur a lockdown, and I spent 14 hours last Sunday recovering. I ended up installing the ISO again and had to do the install without drivers.

My experience with this unit has been…. difficult, to be generous.

At this time I cannot get the FabricManager to stay alive, and therefore all my models load into CPU memory.

Update: Next line is a red herring and I am no longer trying to load FM, now I am trying to get an instance of vLLM running.

Does anyone know how to fix the FabricManager so the GPU can be loaded and invoked?

I have been trying to fix this thing by using Gemini to help me, and for the most part it is helping. However, it often sends me down rabbitholes, and the fabric manager may be one of them.

It also seems that Ollama does not run well on this platform.

I am a seasoned developer trying to learn more about AI and use it in my work. I have worked with a Strix Halo unit as well as the assorted Gaming GPUs and until now, even with the Strix Halo, it was possible to get a model loaded and running in GPU memory.

I have yet to accomplish that with this box. I may return it as I am growing more convinced the ecosystem for this unit is not even Alpha state.

Am I over-reacting? Will this run Ollama, or do I need to use vLLM?

I need the capabilities of this unit, and as long as it will work I have the technical background to deal with it, but it has to work and be reliable once it works. My last week shook my confidence in Nvidia regarding the state of this box and the support.

Hopefully someone can point me in the right direction.

I’m a little confused here. Do you mean NVIDIA Fabric Manager? That package is meant for connecting multiple GPUs through an NVSwitch, not something you would use the Spark for.

Exactly. While I have decades of experience in traditional computing, I needed to rely on Gemini to help me figure out how to resurrect this unit. It thought we needed Fabric Manager and so I wasted a lot of time in that rabbithole until it finally did one of it’s classic “I’m Sorry, we don’t need FM” moments.

At this point I am trying to load vLLM (again, because it took quite a while for Gemini to tell me that Ollama is not suitable for this hardware) and I am trying to get it up and running.

If they had just put a big red card in the box that said “DO NOT TRY TO RENAME THIS UNIT” I probably would be running LLMs now…

Out of curiosity, how did you ‘trigger a lockdown’ by renaming the spark? Both of my units have been renamed from their original hostnames, without incident:

FE model:

dgx-spark:~$ uname -snri
Linux dgx-spark 6.17.0-1014-nvidia aarch64 

HP ZGX Nano G1N:

zgx-spark:~$ uname -snri
Linux zgx-spark 6.17.0-1014-nvidia aarch6

You can easily run ollama on the Spark, just follow the playbook:

Probably worth having a look at the Spark Build site ( DGX Spark ) and follow some of the tutorials and walkthroughs there to get a feel for it.

Thanks. One problem I have is that the UI has never run, and did not run when I power the box on the first time, and I got the impression that it is not supposed to run. My experience with this unit is that it is an unfinished product, at least, the particular unit I got is.

I will give that a try if I can get the download installed via the CLI. My efforts to get the UI going did not go well so I gave up on that.

I see that I can install the WebUI from CLI, so much appreciated!

If you reimaged the Spark using NVidia’s recovery image and following the guide ( System Recovery — DGX Spark User Guide ), the Spark would be wiped and reset back to a ‘factory-new’ state.
Not sure what you meant by ‘installing without drivers’ in the first post, but the recovery/reimage process does run in text mode; once it finishes and the system reboots, everything would’ve been set up back to the factory install, including the login GUI, graphical desktop, etc.

One last thing; there is no distinction between ‘CPU memory’ and ‘GPU memory’; RAM is shared between the CPU and GPU like on the Strix Halo system.
It may be worth revisiting the User Guide ( DGX Spark User Guide — DGX Spark User Guide ) and get a refresh of the Spark’s specifications and capabilities, as well as troubleshooting steps for some of the more common issues.

Please just use https://sparkrun.dev/ and you’ll be up and running in just a few commands.

Regarding how I locked it down, all I know is that I tried to rename it using the standard means that Gemini provided. While I have decades of experience in coding and Windows, my Linux experience is far more limited, so I tend to rely on Gemini. I did Linux back in the early 2000s, but have not used it since. Only in the last few months have I gotten back into it due to AI.
I triggered some sort of security mode that made the drive read only. Since I am not an expert at Linux it took me all day to get it back, then the better part of the last two days to get an LLM loaded.

Thanks to you pointing me to the WebUI you helped me with the final confirmation, and I just watched the GPU hit 96% while processing a 72B Qwen model. That suggests to me I finally have it working.
Still waiting for an answer to my initial input, though… hmm.

And there you have it. The Model loaded into CPU memory when using that model. I think the GPU usage reflected UI work, although I am not sure. This is what I have experienced every time I use Ollama, it loads the model into CPU memory and the CPU does the work while the GPU sits idle.

Regarding the install, I chose the full install and it failed each time, so I eventually tried the install without drivers, and that installed and gave me something of a ui, which I was able to turn off and get a command line and ssh going. Been using it that way since.

My experience from turning this unit on has not been what I expected, and it seems that may be more valid than I thought. I get the impression it was supposed to come up to a full Linux desktop, it did not do that when I initially booted it. Or at least, not the installer desktop.

Now that I ‘raggedy andy’ here, I recall that when I got it reinstalled I did get a more proper desktop, but at that time I figured I would use it via ssh and via my code, but that has not worked out due to the issue with GPU not doing the work.

I never did go back to the desktop that appeared when I got the image reinstalled. You may think that foolish but I have been focused on trying to get a model loaded, so just forgot about it. I had so much trouble even getting ssh working after reloading the OS due to repository conflicts I was happy to just be able to log in remotely.

If there is something useful in the interface please direct my attention to it. I will probably try to get it up and running again soon to verify.

Wow! I finally saw variably GPU up to 96% with more than 80GB Of GPU RAM used in Ollama.

Thanks for telling me it should work. Having that confidence made all the difference, I had to keep at it until it worked.

Much appreciated! Now I can properly evaluate this unit and see if it will meet my needs.

I highly recommend you go through the User Guide I linked in my previous post, specially since you mention your Linux experience is more limited.
Using the DGX Dashboard (DGX Dashboard — DGX Spark User Guide) will be the easiest method to keep your system up-to-date.

Then start going through the Spark Build Site and work through some of the demos to get a feel for the system. You can also look at the community projects listed on Spark Arena (https://spark-arena.com/), including sparkrun (https://sparkrun.dev) like @olbc suggested.

I appreciate your input. As I said earlier, it was instrumental in my eventual success.

I have installed many Oses, so this was far from my first.

What I am saying is that I did the install of the full driver set and it failed. As in did not complete.

I tried it again, it failed again, so I tried the option without drivers, and that succeeded.

Given that there are no choices of consequence I can make differently, I came to the conclusion that the OS did not seem fully baked, or something along those lines.

I have never seen an OS install fail like that, so had no experience to draw from. Linux, Windows, they typically just install.

I am acutely aware of the architectural details of both machines. While very similar, there are indeed internal differences in how memory is connected and used.

Jim Gutterman

jim@x-gecko.com