Ostris' AI Toolkit on DGX Spark

Just letting everyone know, my PR for DGX OS support in AI Toolkit has been merged by Ostris:

It contains instructions and changes needed to get it running on the Spark and other DGX OS devices. I would consider the support ‘initial’, it works, but there are areas that could be improved. I’m hoping NVIDIA might be able to supply a DGX Spark to Ostris himself as that would be the best way to ensure good support for the platform going forward. It is one of the most popular fine-tuners out there for image and video models, so it makes sense to make sure it is well supported on the DGX Spark, just something to think about for the team at NVIDIA.

1 Like

Great news, Been waiting for this

I was able to get AI Toolkit to run on my DGX Spark using the specific instructions linked to the github page. The only issue was a conflict between dgx_requirements.txt and requirements.txt related to scipy==1.16.0 in the former and scipy==1.12.0 in the latter. I deleted the one in requirements.txt and everything installed.

Also, it is easier to install nodejs and npm via apt update.

For ZImage LoRA training, it runs a training iteration in 5.3s/it for 20 images. The dashboard reports using 34.4 GB of RAM.

Thanks for your work on getting AI Toolkit to run on DGX Spark. Much appreciated.

1 Like

Looks like the scipy entry was added to requirements.txt yesterday. I’ll have a chat to Ostris about it. We may need to introduce a shared base requirements file that’s included by the other requirements files.

Alternatively, we could move all requirements into the dgx_requirements.txt file so it doesn’t depend on requirements.txt at all, but then you have to maintain both files, so if any new libraries are added, they would also have to be added to dgx_requirements.txt for DGX OS devices.

I’ll chat to Ostris, see which way he wants to go and then make the change.

It is unfortunately difficult to avoid this kind of problem, especially since Ostris doesn’t have access to a DGX Spark, so it’s impossible for him to validate that changes don’t break things on DGX devices. This is why I want NVIDIA to give Ostris a DGX Spark, so things like this don’t happen, and these devices get the support they deserve (which I can’t provide, I’m just doing what I can to keep it running).

1 Like

The first training was successful, taking 4 hours 33 minutes to run 3000 training steps and making one image every 250 steps.

1 Like

I’ve created a PR to fix the dgx_requirements.txt issue, after speaking with Ostris, he preferred that we just separate the requirements into its own file and maintain them separately. I guess this also offers the benefit that we can use newer versions if anything is identified that works much better on the Spark.

I looked into your suggestion for NodeJS, but ‘apt install nodejs‘ seems to install NodeJS v18, rather than v24. I think it’s better for people to run the current 24 LTS release, so I left the instructions as-is.

1 Like

AI Toolkit is working very well. I’ve created LoRAs with Flux1, ZImage, Illustrious, and Qwen 2512.

What’s the best way to upgrade to a new version as new models are released?

Generally, you should be able to just do a git pull. The python code doesn’t need anything specific unless the requirements.txt has been updated, in which case just pip install those. The command in the instructions for the Node based UI will already build it everything for you. In short, you should really have to do anything specific most of the time. Just git pull and if something breaks, see if you need to update a requirement.

1 Like

Would it be possible for you to post some comparisons with any other regular GPU you might have? I feel like DGX Spark is literally built for this use-case but I can’t find anyone posting genai video training benchmarks.

I’m especially curious about directly comparing a regular GPU (like a 5090 or 4090) and batch sizes vs more iterrations & gradient accumulating. With all of that effective memory the Spark should be able to be very competitive with big GPUs in this task.

The only other GPU I can compare the DGX Spark to is a Gigabyte laptop with a 3070TI (8GB VRAM). The DGX Spark is 2x to 3x faster with the ai and video models I have tried on both machines. The other advantage of the Spark is I can make normal sized videos, the 3070TI was limited to 512x768.

The biggest advantage of the Spark over other GPUs is I never really run out of memory. Even when I’m running gpt-oss:120b, I can still run all of the ComfyUI workflows I have with the exception of Flux 2, and there’s a known bug in ComfyUI related to the shared memory of the Spark that causes that problem.

By the way, git pull worked perfectly for updating AI-Toolkit the last time I tried it.

What you’re asking for is really difficult, ended up taking me literally a couple of hours to find a configuration that would even train on a 5090, 32GB is just not enough for any kind of serious training of video models. To train 109 frames which is what my dataset is currently made for, I had to use low vram setting and switching to WAN 2.2 5B, and train at 512, as soon as I tried 768 I would keep running out of VRAM on the 5090.

Long story short, in a like-for-like training of WAN 2.2 5B, I got the following for the training steps:
5090: 6.57s/it (for training)
DGX Spark: 21.27s/it (for training)

And the following in like-for-like sample generation:
5090: 1.65s/it (for sample image generation)
DGX Spark: 9.14s/it (for sample image generation)

I’ve never worked out why, but in AI toolkit, sample generation has always been particularly slow on the Spark, in this case about 5.5x slower than the 5090. Training is closer to what I’d expect, 3.24x the time it takes on the 5090. In general I tell people the DGX Spark is around 4 times slower than the 5090, but that’s just a rough number, depending on what you’re doing it can obviously be faster or slower than that. In terms of performance, DGX Spark is 1000 TOPS, compared to the 5090’s 3352 TOPS, so in compute heavy tasks, it’s probably going to be fairly close to those numbers as we saw above, but then the memory is a lot slower (273GB/s vs 1.79TB/s, which at worst could mean 6.5x lower performance, but I’ve personally never seen an example where the difference was that large, but it is technically possible), so for workloads that specifically hit the memory hard, such as LLM inference, it would be somewhat slower.

Now in theory, there’s options you can mess around with on the Spark, but at that point we’re not doing a like-for-like comparison, I’m also not going to mess around further as it’s really difficult to do any sort of video fine-tuning on the 5090, it just doesn’t have enough VRAM for it.

As for batch sizes, I usually train with batch size 1 as I haven’t found the increase in speed to be significantly better, and unless something has changed in the last few years, higher batch sizes usually reduce the quality of the training, so it only makes sense if it gives you enough of a boost in performance to justify it. As a test, I switched to batch size 2 (something I normally don’t do), and as you can see, doing twice as much per iteration also happens to take about twice as long: 40.83s/it

This is not a bad thing as it means the GPU is well saturated at batch size 1, you’re already getting near 100% out of the GPU, so doing twice as much just means you’re doing each at about half the speed.

The 5090 is a really fast GPU, but the important thing about the DGX Spark is that it can run and train models that won’t even work on the best consumer GPUs in the first place. If I could only have one or the other, I’d pick the DGX Spark every time.