NVIDIA TensorRT Accelerates Stable Diffusion Nearly 2x Faster with 8-bit Post-Training Quantization

jwitsoe · March 8, 2024, 1:17am

Originally published at: NVIDIA TensorRT Accelerates Stable Diffusion Nearly 2x Faster with 8-bit Post-Training Quantization | NVIDIA Technical Blog

In the dynamic realm of generative AI, diffusion models stand out as the most powerful architecture for generating high-quality images with text prompts. Models like Stable Diffusion have revolutionized creative applications. However, the inference process of diffusion models can be computationally intensive due to the iterative denoising steps required. This presents significant challenges for companies…

qqsongzi · March 11, 2024, 7:27pm

I just noticed the quantization and export code in the blog could be run with TensorRT 9.3, when will the workable version be released?

erinh · March 13, 2024, 6:49pm

Hi @qqsongzi,

The workable version for INT8 quantization with DemoDiffusion is already available in NVIDIA TensorRT repo: TensorRT/demo/Diffusion at release/9.3 · NVIDIA/TensorRT · GitHub.

Please note that the APIs in the above scripts can be slightly different than what’s described in this blog post. The latest quantization APIs with more performance optimization will be released in a few days. We’ll share the wheel here and you can use it with TensorRT 9.3.

We also encourage you to sign-up our session at GTC: Optimize Generative AI inference with Quantization in TensorRT-LLM and TensorRT.

sofaviewing · April 11, 2024, 10:15am

Hi, great article. I noticed that the latest (10.0) TensorRT diffusion demos don’t have the FP8 option. When will this be ready?

I could see that AMMO has the FP8 quantization options, so I tried adapting the example for FP8. However I ran into issues exporting the FP8 quantized model in PyTorch to ONNX. So I thought I must be either be missing something, or there’s some updates still coming?

erinh · April 12, 2024, 7:29am

Hello!

FP8 SDXL will be generally available in the public TensorRT Github repo in a few weeks.
As of now, we can share it with you as early access via email or NVOnline.

I’ve sent you a private message and we can connect there.

Cheers,
Erin

zhangp365 · April 17, 2024, 2:45am

After some learning, I found that compiling the SD TensorRT engine is difficult. Could you please release the code from the article on GitHub? This would make it convenient for people to replicate the results. Thank you very much.

erinh · April 17, 2024, 5:07am

Hi @zhangp365, INT8 quantization example is at NVIDIA TensorRT repo as mentioned above. Does this address your question or you’re asking for something else?

dentiwork · May 6, 2024, 8:15pm

Hi Erin, could you please share the FP8 code with me as well? We would like to test it. Thanks.

zhiyuc · May 18, 2024, 12:12am

We released SDXL quantization example including FP8 solution in TensorRT Model Optimizer diffusers. Feel free to check TensorRT-Model-Optimizer/diffusers at main · NVIDIA/TensorRT-Model-Optimizer · GitHub

zhiyuc · May 18, 2024, 12:17am

We released SDXL quantization examples in TensorRT Model Optimizer diffusers. Feel free to check TensorRT-Model-Optimizer/diffusers at main · NVIDIA/TensorRT-Model-Optimizer · GitHub

wzeng1 · September 14, 2024, 6:19am

Hi @erinh @zhiyuc,
I am interested in this work and try to reproduce using your code. However, after following the step below, build the TRT engine, and run the inference, I didn’t see the speedup from TensorRT INT8 that significant comparing to FP16.

Do you have suggestion on how to get the expected inference speedup? I simply generated 10 images and took the average in python code. If you can provide code snippet to better reproduce the speedup, it would be great too.
Otherwise, if you have backbone.plan shared, I can give it a try as well. Thanks!

Topic		Replies	Views
8-bit 포스트 트레이닝 양자화로 안정적인 확산을 2배 더 빠르게 가속화하는 NVIDIA TensorRT Technical Blog - South Korea	1	342	March 13, 2024
Accelerate Generative AI Inference Performance with NVIDIA TensorRT Model Optimizer, Now Publicly Available Technical Blog	4	355	July 16, 2024
NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs Technical Blog	1	126	May 14, 2025
Optimizing Transformer-Based Diffusion Models for Video Generation with NVIDIA TensorRT Technical Blog	1	68	April 21, 2025
Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT Technical Blog	1	899	December 3, 2023
TensorRT Stable diffusion XL support TensorRT	1	1546	July 28, 2023
New Stable Diffusion Models Accelerated with NVIDIA TensorRT Technical Blog	2	655	February 20, 2025
Segmentation fault (cored dumped) when using TensorRT while quantizing Stable Diffusion 1.5 to Int8 TensorRT cudnn	1	328	May 31, 2024
Unlock Faster Image Generation in Stable Diffusion Web UI with NVIDIA TensorRT Technical Blog	1	645	October 17, 2023
ConvNeXT inference with int8 quantization slower on tensorRT than fp32/fp16 TensorRT cudnn , tensorrt-model-optimizer	2	254	September 19, 2025

NVIDIA TensorRT Accelerates Stable Diffusion Nearly 2x Faster with 8-bit Post-Training Quantization

Related topics