First inference through engine is taking 600% more time than subsequent frames

I went through existing issues and I understand that first inference will take longer time.

In my case, I am packing the my application in a docker image that has the Jetpack components installed. This docker image is build through a CI pipeline and first launch of the application that loads the engine file and runs inference take 600% more time than second, third and so on frames. While debugging, I found that this extra time is taken during loading engine file. I found a hack to save the docker image in pre-loaded state. I do these steps after the image is built on CI:

  1. Start a container
  2. Run the application
  3. Open a new terminal and do “docker commit”
  4. exit the running the container.

[Edit]: this approach does not work if you upload the docker to container registry → remove all the docker images from local → pull the same image.

In this case if I run the application in new container it skips the loading part and hence, works as expected.

I was wondering if there is a way to handle this directly in the CI pipeline. My application uses camera so running the application is not an option.

I would be glad to share more details if required. Thanks for the Help.


The loading engine reads a serialized engine file from the disk into the memory.
To save the deserialize time, you can pre-load the engine as you already done.


In my docker image, I tried pre-loading the engine but if the push the image to the container registry, and pull on a new ORIN with these steps:

  1. remove all the docker images on the ORIN.
  2. Pull the image.

in this case, It is again taking the same time as loading the serialized engine. Am I missing something?

Hi @cryptonyte, you can either run your docker CI pipeline on a Jetson device and do the TensorRT engine building inside your Dockerfile (I have jetson-containers setup in a similar way, where I have self-hosted Jetson workflow runners setup on GitHub). Or you can mount a volume in your container at runtime, and save the serialized TensorRT engine to your mounted volume, so that it persists on your device after container restarts.

Note that if you go with the first way, make sure you have your default docker runtime set to nvidia, so that the GPU can be used during ‘docker build’ commands:

Thanks @dusty_nv, My CI uses gitlab-runner on an ORIN. Generating the engine files are not integrated into the pipeline and is generated separately on another ORIN. Option1 fits best to my case. Thanks for the suggestions, I will work on that and reach out if I encounter issues.

Hi cryptonyte. We have a similar situation in one of our applications. I think it is about CUDA runtime because it only happens in a container that uses CUDA runtime. In my case, I see memory getting filled up to some point and starts processing after memory increases. It is not an inference engine because there is no inference happening in that container and memory is on the CPU. I don’t try it to solve I think it is how it was designed.

Hi @sarperyurttas36, Yes it will fill up memory because while loading engine file, it loads engine file from device and brings into the memory. So, if somehow we can keep memory buffer intact inside the docker image, It should be able to refer to those locations while running the full inference pipeline. This was my assumption when I tried docker commit step manually. It does work, I just want to automate everything in CI.

Hi @cryptonyte , I don’t think I explained myself correctly. First, I don’t use an inference engine nevertheless I wait (more than 2-3 seconds) for the first Cuda operation of a docker container. My assumption is it is loading some libraries from the host to the docker container. In conclusion, whenever you spin up a container and run the application it needs to load cuda runtime, when commit the container you have an image containing cuda runtime. You cannot have the same result without docker commit.

I have another application where I do inference with TensorRT. I don’t have this issue per se. So I will say please check if your container utilizes CUDA runtime, if so it is normal to wait until it loads according to my experience. If you are just using TensorRT runtime, I assume you are using a huge model that takes time to be loaded into memory. I used YOLO models before (v6-v7-v8) and their deserialize time was negligible.

1 Like