Is NVidia aware of the 3X perf boost for Stable Diffusion(SD) image generation of single images at 512x512 resolution?
Doc’s for cuDNN v8.7 mentioned perf improvements but I’m wondering if the degree of improvement has gone unrealized for certain setups.
0) I have a 4090 on a i9-13900K system with 32GB DDR5-6400 CL32 memory.
- AUTOMATIC1111 SD was taking about 1.7+seconds to generate an image at 20 Euler_a steps for a 512x512 image. This is about 13.5 it/s. The GPU was 100% busy.
- As an experiment I downloaded PyTorch 2.0.0 and didn’t really see an improvement for inference. Then I downloaded the source, built my own locally, and got a huge speedup: 620ms per image at 39.5 it/s. Nearly 3X! This was such a huge surprise I had to debug why.
- I found the PyTorch v2 nightly build bundled the older cuDNN and it was found first in the search path.
- A local build of it doesn’t pull any version of cuDNN down so my existing v8.7 install was used and apparently the gain can be huge.
- Few people are aware of that all they have to do is install cuDNN v8.7 and delete the libraries in the PyTorch venv.
There may be a few reasons why this gain went unnoticed:
- I get 3X on my 4090 but people with older cards see less of a benefit although they are happy with the increase. 50%, 150%, I’ve see different numbers being reported.
- Those using Windows are getting mixed results. Even those with a 4090 might only get 30 it/s which is excellent compared with what they got before I gave my workaround. It is still unclear why only a few on Windows can get to the 39.5 I see, although see #3.
- I have found that the combination of a 4090, cuDNN v8.7 and a not too old pytorch/cuda is so fast that that my i9-13900K at 5.8 GHz is just barely fast enough to push the 4090 to 100% busy.
In fact it almost seem exactly just fast enough. What I mean is that if I run on my 4.3 GHz E-cores the performance drops to 43/58th of the speed when run on the P-cores. Also, even those on Linux with slower CPU’s get proportionally slower perf from their 4090.
With regards to cuDNN v8.7, is it possible you didn’t notice this huge boost because you were testing on WIndows or on a slow cpu or both? Or didn’t specifically test the 4090 which appears to see more of an improvement than other cards(but you need a fast cpu to see it)?
Of course, different applications might see different degrees of improvement, but for Stable Diffusion the community has been quite interested in knowing how to fix their PyTorch to get the benefit I showed them. FYI, I convince the PyTorch github team to do a PR to fix this. They are doing this only for PyTorch 2.0 which isn’t GA. So users of old versions need the manual fix.
PS: I just barely got a version of SD using TensorRT working Monday and went from 39.5 it/s to 88,7 it/s!!! That may look great but you have to realize that most users are amazed at the 39.5. So TensorRT will be huge in the SD community once it gets integrated. I don’t even know yet if I can get that number higher. But now that it is functional I’m going to try to upgrade to CUDA 12 and PyTorch 2.0.0 and see what happens.