Description
I am trying to run one segmentation ONNX model with trt.
The input shape: 1, 3, 1600, 832; the output shape: 1, 1, 1600, 832
First, I take a test on Ubuntu OS with A100-40G, and the results seems to be good. The inference time is around 40~50ms. And when I set the fp16 flag, the inference time of fp16-trt-model could be accelerated to around 25ms. The fp16-trt-model is nealy half of fp32-trt-model.
But when I tried to move the same project to Windows10 with P6000, some strange things occured.
The inference time become 400ms for float model.
And when I set the fp16 flag, the model has no speedup (still ~400ms), and the ROM size of the serialized trt model is still close to fp32 model.
Environment
**TensorRT Version: **: 8.4.1
GPU Type: A100 P6000
Nvidia Driver Version: 522.25
CUDA Version: 11.6
CUDNN Version: 8.2.1 / 8.4.1 / 8.5.1 (We tried three version, and the results are the same)
Operating System + Version: Ubuntu20.04 / Windows 10
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): -
PyTorch Version (if applicable): 1.12
Baremetal or Container (if container which image + tag): -
Relevant Files
Steps To Reproduce
Some logs on Windows 10:
[InferenceHelper][117] Use TensorRT
[04/04/2023-18:26:31] [I] [TRT] [MemUsageChange] Init CUDA: CPU +293, GPU +0, now: CPU 19005, GPU 1156 (MiB)
[04/04/2023-18:26:38] [I] [TRT] Loaded engine size: 233 MiB
[04/04/2023-18:26:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +321, GPU +112, now: CPU 19647, GPU 1502 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +269, GPU +82, now: CPU 19916, GPU 1584 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +254, now: CPU 0, GPU 254 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 19678, GPU 1598 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 19678, GPU 1606 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +743, now: CPU 0, GPU 997 (MiB)
[InferenceHelperTensorRt][359] num_of_in_out = 2
[InferenceHelperTensorRt][362] tensor[0]->name: input
[InferenceHelperTensorRt][363] is input = 1
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 3
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][362] tensor[1]->name: output
[InferenceHelperTensorRt][363] is input = 0
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 1
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][456] 3
[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 124.891 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 1024.149 [msec]
Capture: 12.140 [msec]
Image processing: 833.516 [msec]
Pre processing: 10.712 [msec]
Inference: 682.913 [msec]
Post processing: 39.013 [msec]
=== Finished 0 frame ===
[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 324.339 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 495.360 [msec]
Capture: 11.482 [msec]
Image processing: 479.911 [msec]
Pre processing: 24.809 [msec]
Inference: 341.939 [msec]
Post processing: 33.544 [msec]
=== Finished 1 frame ===
Please include:
- Exact steps/commands to build your repro
- Exact steps/commands to run your repro
- Full traceback of errors encountered