In our TensorRT application we have measured Execution Time which covers Input transfer,GPU Processing and Output trasfer . From few references of Nvidia links and Blogs we understood that inference time actually refers latency time.
Could you please guide us on how to measure Latency and system throughput w.r.t our application ?
Profiler implementation can be found in sampleGoogleNet.cpp.
The reportLayerTime() function is called once per layer.
Only the inference time of a layer is calculated, and doesn’t include the input data preparing time. (ex. cudaMemCpy…)
Calculating latency, please sum-up the execution time of pre-process, inference and post-process.