Quantize model with pytorch-quantization

There are two issues when I quantize the first part of BLIP2 model using pytorch-quantization and run it using trtexec, the first is INT8 model consumes more GPU memory than FP16, and the second is INT8 model runs slower than FP16.

Is this due to Q/DQ operations? If so, which quantization tool should we use? I have attached the quantization code.

Environment Information:

  • Pytorch version: 2.2.1
  • TensorRT version: 8.6.0
  • Transformer version: 4.37.2
  • Pytorch-quantization version: 2.2.1

quantize.zip (1.7 KB)

Hi @bob19, I have not used BLIP-2 only CLIP, but in my limited experience with INT8 quantization through TensorRT, it is normally TensorRT that does the quantizing, and you provide it a calibration file to use. Have you tried not pre-quantizing it with PyTorch? Unless @NVES has any suggestions specific to TensorRT, with CLIP I just stick with FP16 for now as it represents a small portion of the overall pipeline (for vision/language model in my case)

The purpose of using PyTorch for model quantization is to generate an int8 ONNX model, which can then be used to create engine files on different platforms withTensorRT.
Do you recommend using TensorRT’s calibrator for quantization? Also, is it recommended to use FP16 for image encoding in VL models?

Thanks Bob, I am not super experienced myself with doing INT8 quantization with TensorRT, but my understanding is that you give it a normal ONNX model (doesn’t need to be INT8 - as you have found, that can actually make it more difficult for TensorRT to load the ONNX with the Q/DQ operands that were already added to the ONNX model, as opposed to the ones that TensorRT selects itself). Then you also need to give TensorRT the calibration file over a representative dataset so that it can calibrate the weights correctly. I think sometimes people do quantization-aware training (QAT) first in PyTorch, but I’m not personally familiar with doing that - the TAO Toolkit does though.

Now for LLM quantization down to 4-bit and INT4, that seems to be a different story and other quantization methods like GPTQ and AWQ are used. But in general, for computing image embeddings and vision encoder, it seems people seem to stick with FP16. The embeddings still end up as FP16. I had done some initial experiment will lower-quantized CLIP for vision/language model, and it made the VLM perform worse, so I think those models may be more sensitive to it given the importance of the image features those embeddings represent. Perhaps there has since been research where the whole model/pipeline is aware of quantization during training for these purposes.