P6000 TensorRT too slow and the serialized fp16-model size is not as expected

nuptsww · April 4, 2023, 10:20am

Description

I am trying to run one segmentation ONNX model with trt.
The input shape: 1, 3, 1600, 832; the output shape: 1, 1, 1600, 832
First, I take a test on Ubuntu OS with A100-40G, and the results seems to be good. The inference time is around 40~50ms. And when I set the fp16 flag, the inference time of fp16-trt-model could be accelerated to around 25ms. The fp16-trt-model is nealy half of fp32-trt-model.

But when I tried to move the same project to Windows10 with P6000, some strange things occured.
The inference time become 400ms for float model.
And when I set the fp16 flag, the model has no speedup (still ~400ms), and the ROM size of the serialized trt model is still close to fp32 model.

Environment

**TensorRT Version: **: 8.4.1
GPU Type: A100 P6000
Nvidia Driver Version: 522.25
CUDA Version: 11.6
CUDNN Version: 8.2.1 / 8.4.1 / 8.5.1 (We tried three version, and the results are the same)
Operating System + Version: Ubuntu20.04 / Windows 10
Python Version (if applicable): 3.8
TensorFlow Version (if applicable): -
PyTorch Version (if applicable): 1.12
Baremetal or Container (if container which image + tag): -

Relevant Files

Steps To Reproduce

Some logs on Windows 10:
[InferenceHelper][117] Use TensorRT
[04/04/2023-18:26:31] [I] [TRT] [MemUsageChange] Init CUDA: CPU +293, GPU +0, now: CPU 19005, GPU 1156 (MiB)
[04/04/2023-18:26:38] [I] [TRT] Loaded engine size: 233 MiB
[04/04/2023-18:26:38] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +321, GPU +112, now: CPU 19647, GPU 1502 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +269, GPU +82, now: CPU 19916, GPU 1584 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +254, now: CPU 0, GPU 254 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 19678, GPU 1598 (MiB)
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 19678, GPU 1606 (MiB)
[04/04/2023-18:26:39] [W] [TRT] TensorRT was linked against cuDNN 8.4.1 but loaded cuDNN 8.2.1
[04/04/2023-18:26:39] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +743, now: CPU 0, GPU 997 (MiB)
[InferenceHelperTensorRt][359] num_of_in_out = 2
[InferenceHelperTensorRt][362] tensor[0]->name: input
[InferenceHelperTensorRt][363] is input = 1
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 3
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][362] tensor[1]->name: output
[InferenceHelperTensorRt][363] is input = 0
[InferenceHelperTensorRt][367] dims.d[0] = 1
[InferenceHelperTensorRt][367] dims.d[1] = 1
[InferenceHelperTensorRt][367] dims.d[2] = 1600
[InferenceHelperTensorRt][367] dims.d[3] = 832
[InferenceHelperTensorRt][371] data_type = 0
[InferenceHelperTensorRt][456] 3
[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 124.891 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 1024.149 [msec]
Capture: 12.140 [msec]
Image processing: 833.516 [msec]
Pre processing: 10.712 [msec]
Inference: 682.913 [msec]
Post processing: 39.013 [msec]
=== Finished 0 frame ===

[SegmentationEngine][148] 832 1600 3
[SegmentationEngine][167] here[InferenceHelperTensorRt][329] process
[InferenceHelperTensorRt][333] 5324800
cudaMemcpyAsync cost 324.339 [msec]
[InferenceHelperTensorRt][340] process_2
[InferenceHelperTensorRt][345] process_3
[InferenceHelperTensorRt][351] process_4
[SegmentationEngine][173] thereTotal: 495.360 [msec]
Capture: 11.482 [msec]
Image processing: 479.911 [msec]
Pre processing: 24.809 [msec]
Inference: 341.939 [msec]
Post processing: 33.544 [msec]
=== Finished 1 frame ===

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · April 4, 2023, 10:37am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

Topic		Replies	Views
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference TensorRT tensorrt , jetson-inference , jetson-nano	1	919	March 13, 2023
How can we know we have convert the onnx to int8trt rather than Float32? TensorRT tensorrt	23	1894	June 14, 2021
TensorRt inference is taking 1.5 sec to inference a single frame.i want to speed up my inference.How can i do that TensorRT tensorrt , cuda , jetson-nano	3	764	March 13, 2023
TensorRT int8 slower than FP16 due to reformat layer TensorRT tensorrt , cudnn	0	96	October 11, 2024
Why is the size of the model exported by `trtexec --fp16` almost the same size as the model without this flag set? TensorRT	2	1125	December 8, 2022
Inference time increases rapidly when set a high resolution input image TensorRT tensorrt , cuda , ubuntu	1	813	September 13, 2023
Tensorrt can not speed up well TensorRT	7	1633	June 29, 2022
Inference fp16 engine in c++ get Nan output but inference fp32 engine can get correct result TensorRT	13	1345	October 10, 2023
Same inference speed for INT8 and FP16 TensorRT	10	5831	October 12, 2021
Tensorflow inference using TRT converted model TensorRT	10	1062	May 25, 2021

P6000 TensorRT too slow and the serialized fp16-model size is not as expected

Description

Environment

Relevant Files

Steps To Reproduce

Related topics