Deserializing cuda-engine@JetPack4.4, performance issue

I am porting code from older JetPack-versions to JetPack 4.4. Earlier we had no problem serializing/deserializing engines to save startup-time at deployment (which is required).
It is working with JetPack 4.4, too, but performance is much worse compared to creating the engine at start-up. Typically it take us ~20ms to run, but if using the deserialized engine it takes ~600ms. I put the thread to sleep for 1s after deserialization to make sure this is not due to that it hasn’t finished yet before executing the kernel, but no luck there either.
Any ideas, maybe some settings that I missed?

Note: It will settle to the normal speed if executing several times. Still strange though that it didn’t exhibit this behavior before, nor that a waiting period during start-up didn’t removed it either. Probably the program that is transferred the first time?