Originally published at: NVIDIA TensorRT Accelerates Stable Diffusion Nearly 2x Faster with 8-bit Post-Training Quantization | NVIDIA Technical Blog
In the dynamic realm of generative AI, diffusion models stand out as the most powerful architecture for generating high-quality images with text prompts. Models like Stable Diffusion have revolutionized creative applications. However, the inference process of diffusion models can be computationally intensive due to the iterative denoising steps required. This presents significant challenges for companies…
I just noticed the quantization and export code in the blog could be run with TensorRT 9.3, when will the workable version be released?
Hi @qqsongzi,
The workable version for INT8 quantization with DemoDiffusion is already available in NVIDIA TensorRT repo: TensorRT/demo/Diffusion at release/9.3 · NVIDIA/TensorRT · GitHub.
Please note that the APIs in the above scripts can be slightly different than what’s described in this blog post. The latest quantization APIs with more performance optimization will be released in a few days. We’ll share the wheel here and you can use it with TensorRT 9.3.
We also encourage you to sign-up our session at GTC: Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT.
Hi, great article. I noticed that the latest (10.0) TensorRT diffusion demos don’t have the FP8 option. When will this be ready?
I could see that AMMO has the FP8 quantization options, so I tried adapting the example for FP8. However I ran into issues exporting the FP8 quantized model in PyTorch to ONNX. So I thought I must be either be missing something, or there’s some updates still coming?
Hello!
FP8 SDXL will be generally available in the public TensorRT Github repo in a few weeks.
As of now, we can share it with you as early access via email or NVOnline.
I’ve sent you a private message and we can connect there.
Cheers,
Erin
After some learning, I found that compiling the SD TensorRT engine is difficult. Could you please release the code from the article on GitHub? This would make it convenient for people to replicate the results. Thank you very much.
Hi @zhangp365, INT8 quantization example is at NVIDIA TensorRT repo as mentioned above. Does this address your question or you’re asking for something else?
Hi Erin, could you please share the FP8 code with me as well? We would like to test it. Thanks.
We released SDXL quantization example including FP8 solution in TensorRT Model Optimizer diffusers. Feel free to check TensorRT-Model-Optimizer/diffusers at main · NVIDIA/TensorRT-Model-Optimizer · GitHub
We released SDXL quantization examples in TensorRT Model Optimizer diffusers. Feel free to check TensorRT-Model-Optimizer/diffusers at main · NVIDIA/TensorRT-Model-Optimizer · GitHub
Hi @erinh @zhiyuc,
I am interested in this work and try to reproduce using your code. However, after following the step below, build the TRT engine, and run the inference, I didn’t see the speedup from TensorRT INT8 that significant comparing to FP16.
Do you have suggestion on how to get the expected inference speedup? I simply generated 10 images and took the average in python code. If you can provide code snippet to better reproduce the speedup, it would be great too.
Otherwise, if you have backbone.plan shared, I can give it a try as well. Thanks!