Hi,
I am using Nvidia model optimizer ( https://nvidia.github.io/Model-Optimizer/guides/\_pytorch_quantization.html ) and converting Yolov7 official object detection model to below quantizations -
AVAILABLE_CONFIGS = {
"int8_default": mtq.INT8_DEFAULT_CFG,
"int8_smoothquant": mtq.INT8_SMOOTHQUANT_CFG,
"fp8_default": mtq.FP8_DEFAULT_CFG, # For H100 and newer GPUs
"w4a8_awq": mtq.W4A8_AWQ_BETA_CFG, # 4-bit weights, 8-bit activations
}
The problem is I am able to convert and the accuracy is also pretty fine (not sure whether quantization is working fine or not). When I do the inferencing, the int4 model is running very slow.
Here is my code
model_q = mtq.quantize(model_c, config, forward_loop)
checkpoint = {
'state_dict': model_q.state_dict(),
}
model_name = 'weights/yolov7_' + config_name + '.pth'
torch.save(checkpoint, model_name)
How do I convert the int4 model to trt engine format so that I can run it faster?
Also, what does this mean?
================================================================================
Testing quantization config: w4a8_awq
Config details: {'quant_cfg': {'*weight_quantizer': [{'num_bits': 4, 'block_sizes': {-1: 128, 'type': 'static'}, 'enable': True}, {'num_bits': (4, 3), 'axis': None, 'enable': True}], '*input_quantizer': {'num_bits': (4, 3), 'axis': None, 'enable': True}, 'nn.BatchNorm1d': {'*': {'enable': False}}, 'nn.BatchNorm2d': {'*': {'enable': False}}, 'nn.BatchNorm3d': {'*': {'enable': False}}, 'nn.LeakyReLU': {'*': {'enable': False}}, '*lm_head*': {'enable': False}, '*proj_out.*': {'enable': False}, '*block_sparse_moe.gate*': {'enable': False}, '*router*': {'enable': False}, '*mlp.gate.*': {'enable': False}, '*mlp.shared_expert_gate.*': {'enable': False}, '*output_layer*': {'enable': False}, 'output.*': {'enable': False}, 'default': {'enable': False}}, 'algorithm': 'awq_lite'}
================================================================================
Thanks