TensorRT: slowdown for buildSerializedNetwork()

Description

We’ve been using TensorRT for a couple of years now, and recently updated TensorRT from 8.0.3 to 8.5.1. The update went great and our functional tests have identical results, but we have noticed slower processing for some functions. One in particular is 2x to 4x slower in TensorRT 8.5: buildSerializedNetwork()

This is quite annoying for our functional tests, since we are running many different models, some of which are quite large: the greatest slowdown is 120s → 450s

This behaviour was seen across platforms (Desktop Linux, Jetson Linux, Windows), across multiple arch (1080, 2080, 3080, Xavier).

NVIDIA: can you explain this slowdown?

Environment

TensorRT Version: 8.5.1.7
GPU Type: 1080, 2070, 2080, 3080, Xavier AGX (TensorRT 8.4)
Nvidia Driver Version: nvidia-driver-525
CUDA Version: 11.8.0
CUDNN Version: 8.6.0.163
Operating System + Version: Ubuntu 20.04
Baremetal or Container (if container which image + tag): Ubuntu 20.04 baremetal

Relevant Files

Code Sample:
trt_sample.cpp (2.1 KB)
CMakeLists.txt (707 Bytes)
FindTensorRT.cmake (3.2 KB)

This slowdown was consistent across all of our own models.
I was able to reproduce it with public models from the onnx/models github repo, such as:

onnx/models/vision/classification/caffenet/model/caffenet-12.onnx
onnx/models/vision/classification/vgg/model/vgg19-bn-7.onnx
onnx/models/vision/classification/zfnet-512/model/zfnet512-12.onnx
onnx/models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx
v v v v v

Steps To Reproduce

On my current setup (Intel + RTX 2070), I am running TensorRT 8.5.1 baremetal, and a docker to run the old TensorRT 8.0.3 (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04)

All results are reproducible on baremetal or nvidia containers.

mkdir build && cd build && cmake ..
make
time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx
time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx
time ./trt_sample ../../onnx_models/vision/classification/zfnet-512/model/zfnet512-12.onnx
time ./trt_sample ../../onnx_models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx

Results

Note that I did not cherrypick these models, these are the first four that I’ve tried. Here are the results for 2 of them.

TensorRT 8.5.1

time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx

real	0m18.036s
user	0m12.277s
sys		0m3.044s

time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx

real	0m34.488s
user	0m22.877s
sys		0m6.568s

TensorRT 8.0.3

time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx

real	0m8.858s
user	0m5.787s
sys		0m1.881s

time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx

real	0m19.729s
user	0m13.557s
sys		0m3.674s

More results

Initially, I added timing printfs in our codebase to find which function was slower: here only the function time is calculated (withchrono::steady_clock::now() ).

Here’s the cleaned output of functional tests:
ctest | grep buildSerializedNetwork

Notice the 2x to 4x slowdown through all the calls (use diff/meld to compare)
trt8517.txt (3.6 KB)
trt8034.txt (3.6 KB)

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

Hi,

I’ve already made sure everything is included in my original post.

Thanks,

Hi NVES,

The original post contains a very concise code sample that shows how to reproduce the problem, have you looked at it?

1 Like

Hi @fl932471 ,
In the updated release a lot more kernels/backends from 8.0 to 8.5 has been introduced which increased the auto-tuning cost. Can you please try with PreviewFeature kDISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805 enabled?

Thanks

Also if you can try TRT 8.6 as build performance has been improvised there.

Thanks

Thanks @AakankshaS for the follow-up!

Unfortunately, setting this saves about <1% to 10 %, far from the ~2x we experience: for example, vgg19-bn-7.onnx goes from 34 to 32 sec, but TRT 8.0.3 has 19 sec). Furthermore, we are currently using cuDNN, so using this option is not possible.

Indeed, we also noticed the new parameters for TRT 8.6 to specify the Builder optimization level, but this requires a thorough investigation as the network performance can be modified. It’ll be on our todo list.

For now we are relying on a caching mechanism so we can prevent rebuilding the network.