What you’re asking for is really difficult, ended up taking me literally a couple of hours to find a configuration that would even train on a 5090, 32GB is just not enough for any kind of serious training of video models. To train 109 frames which is what my dataset is currently made for, I had to use low vram setting and switching to WAN 2.2 5B, and train at 512, as soon as I tried 768 I would keep running out of VRAM on the 5090.
Long story short, in a like-for-like training of WAN 2.2 5B, I got the following for the training steps:
5090: 6.57s/it (for training)
DGX Spark: 21.27s/it (for training)
And the following in like-for-like sample generation:
5090: 1.65s/it (for sample image generation)
DGX Spark: 9.14s/it (for sample image generation)
I’ve never worked out why, but in AI toolkit, sample generation has always been particularly slow on the Spark, in this case about 5.5x slower than the 5090. Training is closer to what I’d expect, 3.24x the time it takes on the 5090. In general I tell people the DGX Spark is around 4 times slower than the 5090, but that’s just a rough number, depending on what you’re doing it can obviously be faster or slower than that. In terms of performance, DGX Spark is 1000 TOPS, compared to the 5090’s 3352 TOPS, so in compute heavy tasks, it’s probably going to be fairly close to those numbers as we saw above, but then the memory is a lot slower (273GB/s vs 1.79TB/s, which at worst could mean 6.5x lower performance, but I’ve personally never seen an example where the difference was that large, but it is technically possible), so for workloads that specifically hit the memory hard, such as LLM inference, it would be somewhat slower.
Now in theory, there’s options you can mess around with on the Spark, but at that point we’re not doing a like-for-like comparison, I’m also not going to mess around further as it’s really difficult to do any sort of video fine-tuning on the 5090, it just doesn’t have enough VRAM for it.
As for batch sizes, I usually train with batch size 1 as I haven’t found the increase in speed to be significantly better, and unless something has changed in the last few years, higher batch sizes usually reduce the quality of the training, so it only makes sense if it gives you enough of a boost in performance to justify it. As a test, I switched to batch size 2 (something I normally don’t do), and as you can see, doing twice as much per iteration also happens to take about twice as long: 40.83s/it
This is not a bad thing as it means the GPU is well saturated at batch size 1, you’re already getting near 100% out of the GPU, so doing twice as much just means you’re doing each at about half the speed.
The 5090 is a really fast GPU, but the important thing about the DGX Spark is that it can run and train models that won’t even work on the best consumer GPUs in the first place. If I could only have one or the other, I’d pick the DGX Spark every time.