What are the best practices to get max performance from TensorRT

This question is specifically targeted at TensorRT dev team and users with experience in optimizing and deploying TensorRT engines/applications in a production environment.

Is creating the network using the network definition API and importing weights from a trained model the most optimal method to get the most out of TensoRT? If not, what are some of the best practices other than what’s mentioned in the documentation?

EDIT:
Environment: Windows
Language: C++
DL Framework: PyTorch/TF
Hardware: GTX1080Ti/RTX2080Ti

It doesn’t matter how you create the network definition (either manually or through one our our parsers) in terms of performance. The things that affect performance are the precision you are running TRT in and the GPU you have.

Please refer to below link for other best practices:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-best-practices/index.html

Thanks