The new TensorRT version introduces this “timing cache” option, enabled by default.
First of all, this should be an optional feature (opt-in), instead of default feature (opt-out), since it’s purely an optimization feature. I spent a huge amount of time debugging my issues due to this feature silently be activated by default.
Second, the real problem is that this feature seems buggy. When enabling it, our networks suffer a drop in numerical performance (I can’t quantify it right now, but it’s large enough to break our application).
Caching is a very hard CS problem, and it’s easy to miss dependencies to take into account when claiming that “a layer is identical to another layer, thus I can re-use results”. Most likely this is what happens - TensorRT is re-using results that it shouldn’t use, leading to an incorrect layer selection.
I understand that you’d need a reproducible example to look at this; unfortunately I can’t provide one at the moment. I just want to report this issue early in case you get the chance to look into it on your side.
This happens on TensorRT 8.2.0 + CUDA 11.4.
Thank you for reporting this observation. We will try to investigate it further at our end.
Could you please alteast give us some repro steps or commands that you used (along with DL algorithm/model type reference) so that we can better help?
Thanks for the reply!
Checking the release notes I suppose this was introduced in TRT 7 - I haven’t tested this release since I’ve moved from TRT 6.5 to TRT 8.2 directly.
The model in question is a custom feedforward CNN (ResNet-like) built in INT8 mode. The problem persists in FP32 only mode. My observations are:
If timing cache is enabled (not serializing to disk), then performance drops, and it changes significantly from build to build. When I talk about performance, I mean e.g. “number of objects detected in this image”. If normally I get e.g. 10 objects, with this setting on I get 30 objects. But then in the next build maybe I get 20 or 40 objects. I.e. the every build produces significantly different results, so the build process is not really reliable.
If timing cache is disabled, then my network produces the same output (e.g. 10 objects) as in TRT 6.5. It can vary ±1 object from build to build, but that’s expected due to the non-deterministic nature of the build process.
We have solved our problem by setting the
kDISABLE_TIMING_CACHE. But as said, these kinds of features should always be OFF by default. Right now this is inconsistent with the rest of the TensorRT API, where a user is always meant to enable things, not disable them.