Does NVidia know about the 300% perf improvement cuDNN can provide?

Is NVidia aware of the 3X perf boost for Stable Diffusion(SD) image generation of single images at 512x512 resolution?
Doc’s for cuDNN v8.7 mentioned perf improvements but I’m wondering if the degree of improvement has gone unrealized for certain setups.
0) I have a 4090 on a i9-13900K system with 32GB DDR5-6400 CL32 memory.

  1. AUTOMATIC1111 SD was taking about 1.7+seconds to generate an image at 20 Euler_a steps for a 512x512 image. This is about 13.5 it/s. The GPU was 100% busy.
  2. As an experiment I downloaded PyTorch 2.0.0 and didn’t really see an improvement for inference. Then I downloaded the source, built my own locally, and got a huge speedup: 620ms per image at 39.5 it/s. Nearly 3X! This was such a huge surprise I had to debug why.
  3. I found the PyTorch v2 nightly build bundled the older cuDNN and it was found first in the search path.
  4. A local build of it doesn’t pull any version of cuDNN down so my existing v8.7 install was used and apparently the gain can be huge.
  5. Few people are aware of that all they have to do is install cuDNN v8.7 and delete the libraries in the PyTorch venv.

There may be a few reasons why this gain went unnoticed:

  1. I get 3X on my 4090 but people with older cards see less of a benefit although they are happy with the increase. 50%, 150%, I’ve see different numbers being reported.
  2. Those using Windows are getting mixed results. Even those with a 4090 might only get 30 it/s which is excellent compared with what they got before I gave my workaround. It is still unclear why only a few on Windows can get to the 39.5 I see, although see #3.
  3. I have found that the combination of a 4090, cuDNN v8.7 and a not too old pytorch/cuda is so fast that that my i9-13900K at 5.8 GHz is just barely fast enough to push the 4090 to 100% busy.
    In fact it almost seem exactly just fast enough. What I mean is that if I run on my 4.3 GHz E-cores the performance drops to 43/58th of the speed when run on the P-cores. Also, even those on Linux with slower CPU’s get proportionally slower perf from their 4090.

With regards to cuDNN v8.7, is it possible you didn’t notice this huge boost because you were testing on WIndows or on a slow cpu or both? Or didn’t specifically test the 4090 which appears to see more of an improvement than other cards(but you need a fast cpu to see it)?

Of course, different applications might see different degrees of improvement, but for Stable Diffusion the community has been quite interested in knowing how to fix their PyTorch to get the benefit I showed them. FYI, I convince the PyTorch github team to do a PR to fix this. They are doing this only for PyTorch 2.0 which isn’t GA. So users of old versions need the manual fix.

PS: I just barely got a version of SD using TensorRT working Monday and went from 39.5 it/s to 88,7 it/s!!! That may look great but you have to realize that most users are amazed at the 39.5. So TensorRT will be huge in the SD community once it gets integrated. I don’t even know yet if I can get that number higher. But now that it is functional I’m going to try to upgrade to CUDA 12 and PyTorch 2.0.0 and see what happens.

NOTE: With the older cuDNN you can still get the same “throughput” with larger batchsizes generating multiple images at the same time. However, the serial generation of images is important for things like video generation although that is more of an “image-in-image-out” recursive process. If that shows a similar improvement, once I get time to test it, then I’m surprised there wasn’t more hype when the newer cuDNN came out.

I am not familiar with the workload, but I think it is quite possible that the superior scaling you are observing compared to other people’s platforms is due to the top-of-the line host platform, at least in part.

An effect already seen years ago with some workloads was that as GPU performance kept increasing faster than that of other system components, host/device communication and CPU performance in the serial or mildly parallel portions of the workload running on the CPU became noticeable bottlenecks in CUDA-accelerated applications.

BTW, I did not know that i9-13900K has support for DDR5-6400. I thought it maxed out at DDR5-5600. Are you quoting the specs of the installed DRAM or the actual CPU memory controller settings? If the latter, does this use some sort of XMP memory profile?

I’ll have to check my ?bios? settings to see if this custom built machine is actually running at 6400.
I thought it was more about whether the mother board can handle the faster drams. As far as XMP goes I think I may have gotten that but I’d have to check. Hmmm, I just read that you are right and 5600 is the max without XMP. I hope the guys at the custom shop didn’t sell me the faster ram if they knew it wasn’t going to run at full speed. I’ll definitely check this and if XMP is enabled.

But, yes, it appears that once you get cuDNN v8.7 paired with a 4090 you then need a very fast processor to push it to the max for the specific use case of single images at 512x512. I’ve been busy and haven’t tested 768x768 with my slower e-cores to see it if matters or not. But most images, in volume, are generated at 512x512 to find good candidates for upscaling to higher resolution. Image generation service providers get requests through their API['s and, of course, want the fastest possible to service many users.

I was surprised the CPU perf mattered so much for what I thought was a largely CPU operation. Could this have anything to do with ?sceduling? I’ve heard Windows has the option to run the scheduling on the GPU or the CPU.

On the 4090 the perf difference is so huge just swapping versions of the cuDNN libraries I wish I knew what you guys did. In knowing that I wonder what else I can leverage it for.

512x512 images are small, and any kernel(s) processing those will likely have very short execution time. Lots of very short-running kernels tend to limit the scalability of CUDA-accelerated applications because of kernel launch overhead. Launch overhead has a software and a hardware component. The hardware component is primarily a function of PCIe version and link width. The software component is largely and primarily influenced by single-thread CPU performance (in turn mostly a function of CPU clock), secondarily by system memory speed. Even well-optimized CUDA applications spend some time in largely serial host code, and this likewise benefits from a faster CPU.

On Windows, GPUs can operate with the default WDDM driver, which imposes higher overhead on the CUDA driver compared to the Linux driver model. With the WDDM driver Windows is in control of GPU memory allocation and many other tasks, and it is primarily designed for system robustness, trying to keep the GPU in a sound state for use by the Windows GUI.

If there is more than one GPU in a system, one GPU can be dedicated to servicing the GUI, while the other GPUs can use the TCC driver (this may not be supported for all GPU models), which makes Windows treat these GPUs as a non-graphical device that does not serve the GUI, and the these GPU are therefore under full control of NVIDIA’s driver. That can have some positive effects on performance compared to the use of the WDDM driver.

The way to think about CPU requirements in a GPU-accelerated system is to consider the GPU responsible for the parallel portion of the workload, and the CPU responsible for the serial portion of the workload. If you use a top-of-the-line GPU like the RTX 4090, which accelerates the parallel portion a lot, such a GPU needs to be paired with a CPU that excels on the serial part of the workload. That means using a moderate number of CPU cores (typically, 4 CPU cores per GPU is sufficient) that run at very high frequency (I typically recommend >= 3.5 GHz). Also, high-bandwidth system memory is needed, especially if there are multiple GPUs in the system. One aspect frequently neglected is the size of the system memory. In a well balanced system, system memory size should be 2x to 4x the total amount of GPU memory.

The above are some general guidelines, and it is well possible that your use case neither needs large GPU memory nor large system memory, and instead benefits from the lower latency configuration of the system memory that is possible when installing just one DRAM stick per memory controller channel (I use such setups for my workstation configurations). At this time I would consider 32 GB of system memory the absolute bare minimum for a high-performance workstation, regardless of whether prospective workloads are CPU-centric or GPU-centric.

I do not work for NVIDIA (I did work for them in the past, 2003-2014), and NVIDIA traditionally is very tight-lipped about the internal workings of their products, beyond what is written up in public documentation including release notes and the occasional publication in a scientific journal.

I looked at some DDR5 benchmarks and it seems that DDR5 6400 CL32 using the appropriate XMP profile provides about 10%-15% higher system memory performance compared to DDR5 5600 CL46 with the JEDEC profile, with both latency and throughput improved. The specific DDR5 6400 CL32 sticks used in the benchmark were G.Skill branded.

So that is a nice high-performance dual-channel DDR5 configuration you got there, roughly equivalent to a DDR4-3200 four-channel configuration, providing a bandwidth of about 100 GB/sec.

so, i didnt get how to do this. explain pls!