ONNX engine initialisation/build takes significantly longer in TensorRT 8.5 vs 8.0

Description

We have upgraded to 8.5.3 from 8.0.1, and have noticed it takes significantly longer to initialise, parse the onnx, build and serialize the engine. The inference is marginally faster, which is nice, but this slower initializing will cause issues for our tests and users.

Is this expected behavior of this version, or a bug?
How can I fix this?

Evidence and steps to reproduce

Our benchmarks:

Initialization with timing cache:

  • 8.0 = 4779 ms
  • 8.5 = 7861 ms

Initialization without timing cache:

  • 8.0 = 27326 ms
  • 8.5 = 80748 ms

We are using fp16, but I think a difference can be observed with any optimization profile.

MNIST sample

I was able to reproduce the increase with trt samples:

  • 8.0 = 3.113s
    docker run --gpus all --rm nvcr.io/nvidia/tensorrt:21.08-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && apt-get update && apt-get install sudo && cd /usr/src/tensorrt/samples/sampleOnnxMNIST/ && make && hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/sample_onnx_mnist --fp16'"
    
  • 8.5 = 10.134s mean
     docker run --gpus all --rm nvcr.io/nvidia/tensorrt:23.03-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && apt-get update && apt-get install sudo && cd /usr/src/tensorrt/samples/sampleOnnxMNIST/ && make && hyperfine --runs 5  --show-output '/usr/src/tensorrt/bin/sample_onnx_mnist --fp16'"
    

MNIST trtexec

Interestingly, I am not able to reproduce with trtexec. I can’t see what trtexec is doing differently to the samples, but trtexec takes excessively long compared which it think masks the issue:

  • 8.0 = 12.128 s mean
    docker run --gpus all --rm nvcr.io/nvidia/tensorrt:21.08-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb &&  hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --explicitBatch --workspace=1024 --fp16'"
    
  • 8.5 = 12.272 s mean
    docker run --gpus all --rm nvcr.io/nvidia/tensorrt:23.03-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb &&  hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --explicitBatch --workspace=1024 --fp16'"
    

Environment

TensorRT Version: 8.5.3
GPU Type: NVIDIA GeForce RTX 3070 Laptop GPU
Nvidia Driver Version: 525.125.06
CUDA Version: 11.4
Operating System + Version: Ubuntu 20.04

Hi,

Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.

Alternatively, you can try running your model with trtexec command.

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:

Thanks!

Can confirm I also see a similar issue. Running the MNIST examples above I get a similar time difference, the newer version is slower

[08/07/2023-15:23:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2522, GPU 3153 (MiB)
&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8001] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
  Time (mean ± σ):      3.523 s ±  0.140 s    [User: 1.690 s, System: 0.882 s]
  Range (min … max):    3.430 s …  3.770 s    5 runs

vs

[08/07/2023-15:26:45] [I] 
&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8503] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
  Time (mean ± σ):      9.991 s ±  0.227 s    [User: 5.434 s, System: 1.863 s]
  Range (min … max):    9.789 s … 10.313 s    5 runs

In particular the logs appear to indicate that the slow new bit of code exists between these log lines

[08/07/2023-15:26:38] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/07/2023-15:26:45] [I] [TRT] Total Activation Memory: 8337798656

As the gap after Detected invalid timing cache, setup a local cache instead (which looks like the same logging message to me) in the first example is only 2 seconds.

1 Like

Hi,

We recommend that you please try the latest TensorRT version, 8.6.1.
If you still observe the same issue, please share with us the complete verbose logs.

Than you.

8.6.1 is slightly faster, but it is still much slower than 8.0.

Is there an explanation, or mitigation for this regression?

MNIST sample
8.0:

&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8001] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
  Time (mean ± σ):      3.026 s ±  0.027 s    [User: 1.528 s, System: 0.710 s]
  Range (min … max):    2.982 s …  3.052 s    5 runs

8.5:

&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8503] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
  Time (mean ± σ):     10.042 s ±  0.208 s    [User: 5.695 s, System: 1.759 s]
  Range (min … max):    9.874 s … 10.364 s    5 runs

8.6.1:

&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8601] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
  Time (mean ± σ):      9.263 s ±  0.116 s    [User: 6.225 s, System: 1.200 s]
  Range (min … max):    9.100 s …  9.404 s    5 runs

Bump, are there any updates on this?

Hi,

We could reproduce this issue. This issue will be fixed in the future releases soon.

Thank you.

1 Like

Great, thanks for the update!

Please let us know when the fix makes it into a release, then I can mark this thread as resolved.