Model deployment latency measurment

cmehrshad · June 2, 2020, 9:44pm

Description

I am using trtexec from TensorRT sample to measure the latency of a UFF model using a serialized engine on Jetson Xavier.

After reading the code (and as expected), I can see that the latency which is reported by this sample, is the time it takes for an image to be fed to the network until the output results are ready.

What I want at the moment is the set up time for the serialized engine too. I know this is a one-time cost, but I want to know what is the latency of loading a serialized engine and setting up everything, prior to running the first inference.

Any ideas on how I can do that before reading the whole code and save time?

Thanks,

Environment

TensorRT Version: 7.1
GPU Type: 512-Core Volta GPU with Tensor Cores
Nvidia Driver Version:
CUDA Version: 10.2
CUDNN Version: 8.0
Operating System + Version: Jetpack 4.4
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

SunilJB · June 3, 2020, 9:46am

You can wrap whatever code your are interested in timing (loading engine, pre-processing, etc.) with two timestamps and take the difference. time.time() in Python and chrono library in C++.
Or you can use something like Triton Inference Server, which I believe records some health/model metrics, such as model load time, inference request times, averages, etc.

Thanks

cmehrshad · June 8, 2020, 10:01pm

Hello SunilJB,

Thanks for your answer. I used chrono and it helped.