Image diffusion speeds

After finally getting my hands on a GB10 I ran some simple torch and diffusers Python code to see how quickly Z-Image-Turbo (BF16 mostly) can generate images on this little magic box. So I used 9 steps, 1024x1024, same prompt: “A dog chasing a stick thrown by an astronaut on the moon, with a lunar lander in the background, and the Earth on the horizon”. Results:

What time (secs)
Default attention, BF16, no compile 12.1
Sage attention, BF16, no compile 13.2
Flash attention, BF16, no compile 12.9
Default attention, BF16, compile 8.1
Default attention, GGUF Q8_0, compile 9.1

The time is average of the last 5 runs, ignoring the first 2 runs. Interestingly it took 2 runs for torch’s lazy compile to finish its job, those runs taking about 30s each. For me, 8s is not bad at all.

I’m wondering if I can run in BF8. I can see Nvidia’s TransformerEngine could be of use, but I haven’t tried installing it yet. And the code would be a little less simple. Anyway, I’ll post a few more results in this thread.

I’m now getting 7.2s per 1024x1024 image with Z-Image-Turbo (9 steps). I suppose the nearly one second speedup is due to the latest updates, nothing else has changed. Meanwhile, with SDXL 1.0, 1024x1024 image, 30 steps, guidance 7.5, I’m getting 11.3s per image. Will probably try Qwen-Image and Flux.2-Klein next.

I tried Qwen Image 2512. A 1024x1024 image, 50 steps. Took around 61s, or around 1 minute. Takes up a lot of memory (around 50%). I think I might try the FP8 version.

I also tried ERNIE-Image-Turbo with 8 steps, 1024x1024, and it takes around 11.2 secs per image. So a bit slower than Z-Image-Turbo (c.f. 7.2.s).

can image generation run faster on two sparks? or is not possible to split the work?

It is not possible with ComfyUI’s default sampling node.

When using ComfyUI on DGX Spark, refer to the issue below. It explains why the memory usage may be higher than expected:

https://github.com/Comfy-Org/ComfyUI/issues/10896