When running the sample: /usr/src/tensorrt/samples/python/uff_ssd/ inference happens much faster when building the engine and then inferencing with it (23 ms). However, after running the sample again after the first time when the engine has been built, serialized, and saved then loaded and deserialized, inferencing is much slower (1100 ms).
The bounding box outputs are exactly the same.
I would like to have a saved engine that can be loaded rather than having to build every time. Is there a reason building on the fly would inference faster than loading an engine file?
You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation
Also, request you to share your model and script if not shared already so that we can help you better.
I understand this python timing code is not as accurate as the timing code executed with trtexec, but the difference is large enough where i suspect something is up.
Probably the issue is library startup time (mostly loading SASS onto the GPU.)
If you’re running the builder and then starting the timer, that time will not be observed.
Is the application a one-shot inference? If not, then you need to time the second invocation.