Since weight only / smooth quant is being widely used in LLMs and supported by FasterTransformer and TRTLLM. Is there any way to use them directly in TensorRT? After all TRTLLM is not supposed to be solution for everything, because it neither support ONNX parser nor support older GPUs and requires complex environment. I have noticed TRT has explicit quant mode, so is it applicable to set all weights to kINT8 to use weight-only quantization?
I’m currently exploring the possibility of implementing weight-only quantization for my models. While I’ve come across information about the FasterTransformer and TRTLLM libraries supporting such quantization techniques, I’m inclined to explore cola man vapes a solution directly within TensorRT due to compatibility concerns and the complexity of the environment required by TRTLLM.