Description
Converting fp16 ViT model from onnx to TensorRT with batch size>1
results in random model inference output. The conversion is successful, but the inference output is always roughly the same and random.
For example the output is for 4 images with broken converted model:
[[ 2.671875 -1.2802734 -0.42578125]]
[[ 2.6484375 -1.2753906 -0.43652344]]
[[ 2.6328125 -1.2939453 -0.40966797]]
[[ 2.6425781 -1.2939453 -0.42871094]]
The correct model output should be:
[[-0.18860717 3.2261243 0.13521218]]
[[-0.9554258 3.263251 0.7200781]]
[[-0.5815705 3.6151268 -0.17602095]]
[[-1.1649361 3.341591 -0.16794899]]
This issue only occurs
- If the model is FP16
- If dynamic batch size applied when converting to TensorRT. This is both applicable to trtexec conversion, and tritonserver tensorrt gpu_execution_accelerator with warmup
batch_size>1
or first request withbatch_size>1
. warmupbatch_size=1
and single image per request runs perfectly fine.
The same onnx model runs find in a local inference script or deployed with tritonserver regardless of the batch size and precision. Also I’m trying the same experiments with yolov7 segmentation model and both fp16 and fp32 models with dynamic batch size works properly.
Any idea why?
Any help would be appreciated. I want to deploy my model in half precision because the speed difference is huge:
onnx fp32 with tensorrt optimizations
bs=1 55fps
onnx fp16 without tensorrt optimizations
bs=1 108fps
onnx fp16 with tensorrt optimizations
bs=1 140fps
Environment
TensorRT Version: 8.5.2.2
GPU Type: 2080ti, 3080
Nvidia Driver Version: 525.89.02
CUDA Version: 12.0
CUDNN Version: 8.7.0
Operating System + Version: Fedora Linux 37.20230223.0 (Silverblue)
Python Version (if applicable): 3.8.10
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.13.0
Baremetal or Container (if container which image + tag): nvcr.io/nvidia/tensorrt:23.01-py3, nvcr.io/nvidia/tritonserver:23.01-py3
Relevant Files
- GitHub - huggingface/pytorch-image-models: PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more
- GitHub - WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
Steps To Reproduce
- I have a ViT model that I converted to onnx and then TensorRT with:
import torch
from timm import create_model
torch.hub._validate_not_a_forked_repo = lambda a, b, c: True
# load model
model = create_model(
"vit_base_patch16_224",
num_classes=3,
in_chans=3,
pretrained=True
)
model.cuda()
model.eval()
model.half()
x = torch.randn(1, 3, 224, 224, requires_grad=True).cuda().half()
torch_out = model(x)
torch.onnx.export(
model, # model being run
x, # model input (or a tuple for multiple inputs)
"vit_base_patch16_224_half.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=10, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=['input'], # the model's input names
output_names=['output'], # the model's output names
dynamic_axes={'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
I’m using a custom finetuned model, but the same is applicable to pre-trained model provided by pytorch-image-models
- Convert it to TensorRT with
trtexec --onnx=vit_base_patch16_224_half.onnx --minShapes=input:1x3x224x224 --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224 --fp16 --saveEngine=vit_base_patch16_224_half.trt
When I use this model, both locally with a inference script that loads tensorrt engine, and deployed it with tritonserver, it always gives outputs like these for every image:
The output is for 4 images:
[[ 2.671875 -1.2802734 -0.42578125]]
[[ 2.6484375 -1.2753906 -0.43652344]]
[[ 2.6328125 -1.2939453 -0.40966797]]
[[ 2.6425781 -1.2939453 -0.42871094]]
The correct model output should be:
[[-0.18860717 3.2261243 0.13521218]]
[[-0.9554258 3.263251 0.7200781]]
[[-0.5815705 3.6151268 -0.17602095]]
[[-1.1649361 3.341591 -0.16794899]]
- I also deployed the exact same model with tritonserver with/without gpu_execution_accelerator. Here is the tritonserver config:
name: "vitaction"
platform: "onnxruntime_onnx"
default_model_filename: "vit_base_patch16_224_half.onnx"
max_batch_size : 32
input [
{
name: "input"
data_type: TYPE_FP16
dims: [ 3, 224, 224 ]
}
]
output [
{
name: "output"
data_type: TYPE_FP16
dims: [ 3 ]
}
]
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "1073741824" }
parameters { key: "trt_engine_cache_enable" value: "true" }
parameters { key: "trt_engine_cache_path" value: "/var/cache/trt_cache" }
}
]}
}
model_warmup [{
name: "warmup"
batch_size: 1 # I'm changing this
inputs: [{
key: "input"
value: {
data_type: TYPE_FP16
dims: [3, 224, 224]
zero_data: true
}
}]
}]
If I change the batch_size
in the model_warmup, or I sent a request for more than one image, the inference output is the same. If I convert the model with batch_size=1
with trtexec, or warmup with batch_size=1
, the model generates proper outputs.
This issue only occurs if the model is half precision, and batch_size is bigger than 1. If I use the exact same script to convert a fp32 model and use every config/command exactly the same except fp16 flags, model outputs proper inference output regardless of the batch size.
I’m sure that this issue is not related to onnx conversion step because I can use the onnx model with a local script, and I can deploy the onnx model with tritonserver with the exact config minus gpu_execution_accelerator
and model_warmup
. Both local runs and tritonserver responses are perfectly fine.
I also tried another model, yolov7. The problem doesn’t occur for yolov7 model for neither precision.