We have upgraded to 8.5.3 from 8.0.1, and have noticed it takes significantly longer to initialise, parse the onnx, build and serialize the engine. The inference is marginally faster, which is nice, but this slower initializing will cause issues for our tests and users.
Is this expected behavior of this version, or a bug? How can I fix this?
Evidence and steps to reproduce
Our benchmarks:
Initialization with timing cache:
8.0 = 4779 ms
8.5 = 7861 ms
Initialization without timing cache:
8.0 = 27326 ms
8.5 = 80748 ms
We are using fp16, but I think a difference can be observed with any optimization profile.
MNIST sample
I was able to reproduce the increase with trt samples:
8.0 = 3.113s
docker run --gpus all --rm nvcr.io/nvidia/tensorrt:21.08-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && apt-get update && apt-get install sudo && cd /usr/src/tensorrt/samples/sampleOnnxMNIST/ && make && hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/sample_onnx_mnist --fp16'"
8.5 = 10.134s mean
docker run --gpus all --rm nvcr.io/nvidia/tensorrt:23.03-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && apt-get update && apt-get install sudo && cd /usr/src/tensorrt/samples/sampleOnnxMNIST/ && make && hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/sample_onnx_mnist --fp16'"
MNIST trtexec
Interestingly, I am not able to reproduce with trtexec. I can’t see what trtexec is doing differently to the samples, but trtexec takes excessively long compared which it think masks the issue:
8.0 = 12.128 s mean
docker run --gpus all --rm nvcr.io/nvidia/tensorrt:21.08-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --explicitBatch --workspace=1024 --fp16'"
8.5 = 12.272 s mean
docker run --gpus all --rm nvcr.io/nvidia/tensorrt:23.03-py3 sh -c "wget https://github.com/sharkdp/hyperfine/releases/download/v1.16.1/hyperfine_1.16.1_amd64.deb && apt install ./hyperfine_1.16.1_amd64.deb && hyperfine --runs 5 --show-output '/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --explicitBatch --workspace=1024 --fp16'"
Request you to share the model, script, profiler, and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer to the below links for more details:
Can confirm I also see a similar issue. Running the MNIST examples above I get a similar time difference, the newer version is slower
[08/07/2023-15:23:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 2522, GPU 3153 (MiB)
&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8001] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
Time (mean ± σ): 3.523 s ± 0.140 s [User: 1.690 s, System: 0.882 s]
Range (min … max): 3.430 s … 3.770 s 5 runs
vs
[08/07/2023-15:26:45] [I]
&&&& PASSED TensorRT.sample_onnx_mnist [TensorRT v8503] # /usr/src/tensorrt/bin/sample_onnx_mnist --fp16
Time (mean ± σ): 9.991 s ± 0.227 s [User: 5.434 s, System: 1.863 s]
Range (min … max): 9.789 s … 10.313 s 5 runs
In particular the logs appear to indicate that the slow new bit of code exists between these log lines
[08/07/2023-15:26:38] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/07/2023-15:26:45] [I] [TRT] Total Activation Memory: 8337798656
As the gap after Detected invalid timing cache, setup a local cache instead (which looks like the same logging message to me) in the first example is only 2 seconds.
We recommend that you please try the latest TensorRT version, 8.6.1.
If you still observe the same issue, please share with us the complete verbose logs.