Currently in order to build a PLAN file using TensorRT you need to do it in the target platform that you intend to run inference on.
This is a hassle in a number of aspects, especially if you want to be able to test your model in different GPUs.
Would it be possible to add an option to TensorRT to be able to “cross-build” PLAN files? At the expense of suboptimal optimization, of course.
Similar to what NVCC can do already, where you tell it which CC you want to build your CUDA code for. This way you could for example build a PLAN file on your laptop and quickly copy it over to e.g. a DrivePX to run inference on it.
Unfortunately, that capability isn’t available at the moment.
However, in your described process of building all engines on one device, copying over the engine file to target devices, and then running inference on each device - wouldn’t it take equal, if not less, time to copy over the conversion script (or use trtexec) + the model file to all devices and run build + inference? In fact, you’d even be able to build multiple engines in parallel this way, whereas building all engines on one device would have to be done sequentially.
For example, if you’re already copying over a file, ssh’ing into each device, and then running inference, can’t you just combine executing your build + inference scripts into a wrapper bash script and then just copy over model+script, and run your wrapper script for build+inference on each device?
It seems like the same amount of work, and you don’t have the “expense of suboptimal optimization” either.
That’s correct, I could make a script that does the building plus the inference. There’s a few drawbacks though:
-We need to deploy temporary calibration caches that are not part of the final application.
-It currently takes around 15 minutes per network to generate an INT8 PLAN file on DDPX (even using the calibration cache). This is not the case when building PLAN files on an x86 computer - I blame the slow ARM cores of the DDPX. Having a few networks to deploy, the build time can get up to a few hours, which is not very practical.
We want to be able to deploy by simply copying over only the necessary built files that compose the application, and then immediately run the application without having to wait for anything else. Building can be done at any other time, perhaps even overnight.
In general I think it would be good to give an option to the user to “not run optimization”, and use “a reasonably fast” implementation for each layer. Similar to NVCC, a “PLAN file compiler” could choose reasonably fast implementations for each requested CC.
This would also be beneficial in terms of reproducibility; right now PLAN files are not reproducible, since building them depends on the status of the target it’s built on, which is not deterministic. Being able to choose a “reproducible, but not the fastest possible” implementation is very valuable in my opinion.
To be honest, I actually see some flaws in the optimization process - you optimize based on the “current” status of your target machine. But then you might run it on another target machine (still same architecture) that has a “different” state.
Example: you have a DDPX in a build farm and a DDPX in the car. The DDPX in the build farm has no other process running on it and you build a PLAN file there. Then you copy over that PLAN file to the DDPX on the car, where some other processes are running. Therefore the state is different, and perhaps the optimization that you did in the build farm doesn’t hold anymore when you run it in the car?
I realize this is an old topic, but I wanted to second the request for cross-building TensorRT plan files. I have a Jetson Nano 2GB, and the ONNX->TensorRT build step runs of of memory and gets killed. It would be great to be able to build the plan file on a machine with more RAM and faster CPUs and then just transfer it to the target for inference.