Description
We’ve been using TensorRT for a couple of years now, and recently updated TensorRT from 8.0.3 to 8.5.1. The update went great and our functional tests have identical results, but we have noticed slower processing for some functions. One in particular is 2x to 4x slower in TensorRT 8.5: buildSerializedNetwork()
This is quite annoying for our functional tests, since we are running many different models, some of which are quite large: the greatest slowdown is 120s → 450s
This behaviour was seen across platforms (Desktop Linux, Jetson Linux, Windows), across multiple arch (1080, 2080, 3080, Xavier).
NVIDIA: can you explain this slowdown?
Environment
TensorRT Version: 8.5.1.7
GPU Type: 1080, 2070, 2080, 3080, Xavier AGX (TensorRT 8.4)
Nvidia Driver Version: nvidia-driver-525
CUDA Version: 11.8.0
CUDNN Version: 8.6.0.163
Operating System + Version: Ubuntu 20.04
Baremetal or Container (if container which image + tag): Ubuntu 20.04 baremetal
Relevant Files
Code Sample:
trt_sample.cpp (2.1 KB)
CMakeLists.txt (707 Bytes)
FindTensorRT.cmake (3.2 KB)
This slowdown was consistent across all of our own models.
I was able to reproduce it with public models from the onnx/models github repo, such as:
onnx/models/vision/classification/caffenet/model/caffenet-12.onnx
onnx/models/vision/classification/vgg/model/vgg19-bn-7.onnx
onnx/models/vision/classification/zfnet-512/model/zfnet512-12.onnx
onnx/models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx
v v v v v
Steps To Reproduce
On my current setup (Intel + RTX 2070), I am running TensorRT 8.5.1 baremetal, and a docker to run the old TensorRT 8.0.3 (nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04)
All results are reproducible on baremetal or nvidia containers.
mkdir build && cd build && cmake ..
make
time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx
time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx
time ./trt_sample ../../onnx_models/vision/classification/zfnet-512/model/zfnet512-12.onnx
time ./trt_sample ../../onnx_models/vision/object_detection_segmentation/duc/model/ResNet101-DUC-12.onnx
Results
Note that I did not cherrypick these models, these are the first four that I’ve tried. Here are the results for 2 of them.
TensorRT 8.5.1
time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx
real 0m18.036s
user 0m12.277s
sys 0m3.044s
time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx
real 0m34.488s
user 0m22.877s
sys 0m6.568s
TensorRT 8.0.3
time ./trt_sample ../../onnx_models/vision/classification/caffenet/model/caffenet-12.onnx
real 0m8.858s
user 0m5.787s
sys 0m1.881s
time ./trt_sample ../../onnx_models/vision/classification/vgg/model/vgg19-bn-7.onnx
real 0m19.729s
user 0m13.557s
sys 0m3.674s
More results
Initially, I added timing printfs in our codebase to find which function was slower: here only the function time is calculated (withchrono::steady_clock::now()
).
Here’s the cleaned output of functional tests:
ctest | grep buildSerializedNetwork
Notice the 2x to 4x slowdown through all the calls (use diff/meld to compare)
trt8517.txt (3.6 KB)
trt8034.txt (3.6 KB)