Should I build the tensorrt engine in production environment?

We develop C++ application to run tensorrt inference on AV(autonomous vehicle) with Drive Orin. The DriveOS is 6.0.5.1. As we know, when the tensorrt engine is converted from onnx model, the optimization is affected by the available hardware resource. The hardware resource on the AV when tensorrt inference is performed (there are several applications that run trt engine concurrently)is rather different from the environment when tensorrt engine is converted in office.

How can we get the tensorrt engine which is best optimal on the AV?

At present what we do is using trtexec to convert in office then deliver the trt engine to AV. It might not be the best solution.

There is another way: On the AV, when the application first run, it loads the onnx model and builds it to trt engine, doing the calibration, then saves the engine for later usage. But the building and calibrating can cause a long time. Should we do like this? And can we get a better optimization for the trt engine like this?

Thanks

Dear @tjliupeng,
You can use TRT APIs or trtexec tool to generate TRT model. Note that trtexec application also built using TRT APIs. You can generate model once and reuse(reload) it whenever you need.

So, the available hardware resource difference doesn’t matter? When I use trtexec in office, there is no other trt engine to run. But when the trt engine runs on AV, there are several applications running trt engine inference parallel. The available GPU resource should be really different.

Thanks.

Dear @tjliupeng,
You need to generate TRT model on hardware where you want to use(may be it is x86 host or Drive platform). You can run inference only when there are available resources. If you want to run two inference models in parallel, you can either schedule them on different resources. For example, if you take DRIVE AGX Orin platform, it has a iGPU and DLA engines. So you generate GPU/DLA TRT model using trtexec or TRT APIs to run the model on the specific engine. In case, you want to run two inference tasks on same GPU, you can use different cudaStreams for each task to achieve it.
But, note that, launching two tasks in parallel does not mean they run in parallel. if the GPU is fully used by one task, other task can not run. You can actually verify if two GPU tasks are running in nsys trace.

I did not understand your test. what is the platform your are intended to run your model? Does the testing platform already has any other GPU/DLA workload?

Welcome to the forum! Please note that this platform is exclusively for developers who are part of the NVIDIA DRIVE™ AGX SDK Developer Program. To post in the forum, please use an account associated with your corporate or university email address. This helps us maintain the integrity of the forum as a platform for verified members of the developer program. Thank you for your cooperation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.