TensorRT: slowdown for buildSerializedNetwork()

fl932471 · March 3, 2023, 2:11am

Description

We’ve been using TensorRT for a couple of years now, and recently updated TensorRT from 8.0.3 to 8.5.1. The update went great and our functional tests have identical results, but we have noticed slower processing for some functions. One in particular is 2x to 4x slower in TensorRT 8.5: buildSerializedNetwork()

This is quite annoying for our functional tests, since we are running many different models, some of which are quite large: the greatest slowdown is 120s → 450s

This behaviour was seen across platforms (Desktop Linux, Jetson Linux, Windows), across multiple arch (1080, 2080, 3080, Xavier).

NVIDIA: can you explain this slowdown?

Environment

TensorRT Version: 8.5.1.7
GPU Type: 1080, 2070, 2080, 3080, Xavier AGX (TensorRT 8.4)
Nvidia Driver Version: nvidia-driver-525
CUDA Version: 11.8.0
CUDNN Version: 8.6.0.163
Operating System + Version: Ubuntu 20.04
Baremetal or Container (if container which image + tag): Ubuntu 20.04 baremetal

Relevant Files

Code Sample:
trt_sample.cpp (2.1 KB)
CMakeLists.txt (707 Bytes)
FindTensorRT.cmake (3.2 KB)

This slowdown was consistent across all of our own models.
I was able to reproduce it with public models from the onnx/models github repo, such as:

onnx/models/vision/classification/caffenet/model/caffenet-12.onnx
onnx/models/vision/classification/vgg/model/vgg19-bn-7.onnx
onnx/models/vision/classification/zfnet-512/model/zfnet512-12.onnx
onnx/models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx
v v v v v

Steps To Reproduce

On my current setup (Intel + RTX 2070), I am running TensorRT 8.5.1 baremetal, and a docker to run the old TensorRT 8.0.3 (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04)

All results are reproducible on baremetal or nvidia containers.

mkdir build && cd build && cmake ..
make
time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx
time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx
time ./trt_sample ../../onnx_models/vision/classification/zfnet-512/model/zfnet512-12.onnx
time ./trt_sample ../../onnx_models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx

Results

Note that I did not cherrypick these models, these are the first four that I’ve tried. Here are the results for 2 of them.

TensorRT 8.5.1

time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx

real	0m18.036s
user	0m12.277s
sys		0m3.044s

time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx

real	0m34.488s
user	0m22.877s
sys		0m6.568s

TensorRT 8.0.3

time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx

real	0m8.858s
user	0m5.787s
sys		0m1.881s

time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx

real	0m19.729s
user	0m13.557s
sys		0m3.674s

More results

Initially, I added timing printfs in our codebase to find which function was slower: here only the function time is calculated (withchrono::steady_clock::now() ).

Here’s the cleaned output of functional tests:
ctest | grep buildSerializedNetwork

Notice the 2x to 4x slowdown through all the calls (use diff/meld to compare)
trt8517.txt (3.6 KB)
trt8034.txt (3.6 KB)

NVES · March 3, 2023, 2:37am

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

fl932471 · March 3, 2023, 2:39am

Hi,

I’ve already made sure everything is included in my original post.

Thanks,

fl932471 · March 8, 2023, 2:45pm

Hi NVES,

The original post contains a very concise code sample that shows how to reproduce the problem, have you looked at it?

AakankshaS · March 29, 2023, 4:23pm

Hi @fl932471 ,
In the updated release a lot more kernels/backends from 8.0 to 8.5 has been introduced which increased the auto-tuning cost. Can you please try with PreviewFeature kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 enabled?

Thanks

AakankshaS · March 29, 2023, 5:19pm

Also if you can try TRT 8.6 as build performance has been improvised there.

Thanks

fl932471 · April 1, 2023, 2:18am

Thanks @AakankshaS for the follow-up!

Unfortunately, setting this saves about <1% to 10 %, far from the ~2x we experience: for example, vgg19-bn-7.onnx goes from 34 to 32 sec, but TRT 8.0.3 has 19 sec). Furthermore, we are currently using cuDNN, so using this option is not possible.

Indeed, we also noticed the new parameters for TRT 8.6 to specify the Builder optimization level, but this requires a thorough investigation as the network performance can be modified. It’ll be on our todo list.

For now we are relying on a caching mechanism so we can prevent rebuilding the network.

Topic		Replies	Views
Building a network takes too long TensorRT	5	1371	May 5, 2021
ONNX engine initialisation/build takes significantly longer in TensorRT 8.5 vs 8.0 TensorRT tensorrt , performance , benchmarks	10	1699	August 20, 2024
TensorRT build.build_serialized_network return silent None TensorRT	1	2864	September 25, 2021
Huge speed difference between engines built from scratch and engines built from onnx TensorRT	9	1218	January 7, 2022
Trtexec vs builder.build_serialized_network TensorRT	1	563	June 30, 2024
TensorRT Inference Slower When Loading Searalized Engine than Building on the Fly TensorRT jetson-inference	6	1448	March 10, 2021
TensorRT model build time and deployment TensorRT tensorrt , cuda , computer-vision-cv	3	3157	February 2, 2022
Different inference time when loading engine from serialized file TensorRT tensorrt	14	1698	November 2, 2021
TRT8 serialize() return nullptr Jetson AGX Orin tensorrt	14	649	July 24, 2023
Tensor RT optimization causes performance downgrade compared to onnx model TensorRT	4	1120	January 26, 2022