What is the official suggestion to use weight only quantization / smooth quant in TensorRT?


Since weight only / smooth quant is being widely used in LLMs and supported by FasterTransformer and TRTLLM. Is there any way to use them directly in TensorRT? After all TRTLLM is not supposed to be solution for everything, because it neither support ONNX parser nor support older GPUs and requires complex environment. I have noticed TRT has explicit quant mode, so is it applicable to set all weights to kINT8 to use weight-only quantization?


Hi @1055057679 ,
SmoothQuant support and perf optimization support has been added to latest TRT release and can be used via onnx path

Hi, do you mean TRT9.1? May I ask for a document for that with respect to SmoothQuant? Or is that expected to come out soon?

And I’m afraid TRT9.1 doesn’t support P100, does it?

I’m currently exploring the possibility of implementing weight-only quantization for my models. While I’ve come across information about the FasterTransformer and TRTLLM libraries supporting such quantization techniques, I’m inclined to explore cola man vapes a solution directly within TensorRT due to compatibility concerns and the complexity of the environment required by TRTLLM.