Hi,
Can anyone guide me to some example for how to perform int4/fp4 quantization? I am able to perform int8 quantization so far.
Hi,
Can anyone guide me to some example for how to perform int4/fp4 quantization? I am able to perform int8 quantization so far.
Hi,
Which frameworks do you use?
For example, you can set the quantization to INT4 for a MLC use case like beloew:
https://elinux.org/Jetson/L4T/Jetson_AI_Stack#MLC
mlc_llm convert_weight models/Qwen3-30B-A3B-Instruct-2507/
–quantization q4bf16_1
–model-type qwen3_moe
–device cuda
–source-format huggingface-safetensor\
Thanks.
Hi,
I am using pytorch models and using tensorrt10 for quantization related stuffs. If we do int4 quantization, how calibration cache will be generated? I am using models for OD/IS
Hi,
Please try to use our TensorRT Model Optimizer to quantize the model:
Thanks.
Will it be supported on Nvidia Thor? Also, any example how to perform int4 quantization. I could not find anything.
Hi,
Could you share what kind of model you want to quantize so we can find a corresponding one for you?
Do you want to apply quantization for an LLM or VLM?
Thanks.
Hi @AastaLLL
I am trying to run Yolov7 (GitHub - WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors) model. I have run it with int8 quantization and is working fine. I want to try int4 and fp4 quantization.
Hi,
Usually, INT4/FP4 quantization is applied to the LLM model.
For CNN, it’s more common to use INT8 to preserve precision.
You can find the CNN quantization with TensorRT Model Optimizer below:
For int4, there are some examples in the link below:
Thanks.