P6000 TensorRT too slow and the serialized fp16-model size is not as expected

Description

I am trying to run one segmentation ONNX model with trt.
The input shape: 1, 3, 1600, 832; the output shape: 1, 1, 1600, 832
First, I take a test on Ubuntu OS with A100-40G, and the results seems to be good. The inference time is around 40~50ms. And when I set the fp16 flag, the inference time of fp16-trt-model could be accelerated to around 25ms. The fp16-trt-model is nealy half of fp32-trt-model.

But when I tried to move the same project to Windows10 with P6000, some strange things occured.
The inference time become 400ms for float model.
And when I set the fp16 flag, the model has no speedup (still ~400ms), and the ROM size of the serialized trt model is still close to fp32 model.

Environment

**TensorRT Version: **: 8.4.1
GPU Type: A100 P6000
Nvidia Driver Version: 522.25
CUDA Version: 11.6
CUDNN Version: 8.2.1 / 8.4.1 / 8.5.1 (We tried three version, and the results are the same)
Operating System + Version: Ubuntu20.04 / Windows 10
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): -
PyTorch Version (if applicable): 1.12
Baremetal or Container (if container which image + tag): -

Relevant Files

Steps To Reproduce

Some logs on Windows 10:
[InferenceHelper][117] Use TensorRT
[04/04/2023-18:26:31] [I] [TRT] [MemUsageChange] Init CUDA: CPU +293, GPU +0, now: CPU 19005, GPU 1156 (MiB)
[04/04/2023-18:26:38] [I] [TRT] Loaded engine size: 233 MiB
[04/04/2023-18:26:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +321, GPU +112, now: CPU 19647, GPU 1502 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +269, GPU +82, now: CPU 19916, GPU 1584 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +254, now: CPU 0, GPU 254 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 19678, GPU 1598 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 19678, GPU 1606 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +743, now: CPU 0, GPU 997 (MiB)
[InferenceHelperTensorRt][359] num_of_in_out = 2
[InferenceHelperTensorRt][362] tensor[0]->name: input
[InferenceHelperTensorRt][363] is input = 1
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 3
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][362] tensor[1]->name: output
[InferenceHelperTensorRt][363] is input = 0
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 1
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][456] 3
[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 124.891 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 1024.149 [msec]
Capture: 12.140 [msec]
Image processing: 833.516 [msec]
Pre processing: 10.712 [msec]
Inference: 682.913 [msec]
Post processing: 39.013 [msec]
=== Finished 0 frame ===

[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 324.339 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 495.360 [msec]
Capture: 11.482 [msec]
Image processing: 479.911 [msec]
Pre processing: 24.809 [msec]
Inference: 341.939 [msec]
Post processing: 33.544 [msec]
=== Finished 1 frame ===

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!