That’s correct, I could make a script that does the building plus the inference. There’s a few drawbacks though:
-We need to deploy temporary calibration caches that are not part of the final application.
-It currently takes around 15 minutes per network to generate an INT8 PLAN file on DDPX (even using the calibration cache). This is not the case when building PLAN files on an x86 computer - I blame the slow ARM cores of the DDPX. Having a few networks to deploy, the build time can get up to a few hours, which is not very practical.
We want to be able to deploy by simply copying over only the necessary built files that compose the application, and then immediately run the application without having to wait for anything else. Building can be done at any other time, perhaps even overnight.
In general I think it would be good to give an option to the user to “not run optimization”, and use “a reasonably fast” implementation for each layer. Similar to NVCC, a “PLAN file compiler” could choose reasonably fast implementations for each requested CC.
This would also be beneficial in terms of reproducibility; right now PLAN files are not reproducible, since building them depends on the status of the target it’s built on, which is not deterministic. Being able to choose a “reproducible, but not the fastest possible” implementation is very valuable in my opinion.
To be honest, I actually see some flaws in the optimization process - you optimize based on the “current” status of your target machine. But then you might run it on another target machine (still same architecture) that has a “different” state.
Example: you have a DDPX in a build farm and a DDPX in the car. The DDPX in the build farm has no other process running on it and you build a PLAN file there. Then you copy over that PLAN file to the DDPX on the car, where some other processes are running. Therefore the state is different, and perhaps the optimization that you did in the build farm doesn’t hold anymore when you run it in the car?