TensorRT enqueueV2\enqueue nvinfer1::IProfiler Vs. boost c++ chorno measured time comparison


Using the TensorRT nvinfer1::IProfiler interface for inference time consuming provides ~20ms less than the consumed time that was measured using boost c++ chrono timers interface which warps the enqueueV2 API.


TensorRT Version - 7.1.3:
GPU Type- GeForce RTX 2080 TI:
Nvidia Driver Version - R451.77:
CUDA Version - 11.0:
CUDNN Version - 8.0.1:
Operating System + Version - Windows 10 Enterprise 2016 LTSB:
Python Version (if applicable) - 3.6.8:
TensorFlow Version (if applicable) - 1.15.3:
PyTorch Version (if applicable) - 1.1.0:

Relevant Files

If further information or material will be required please specify.

Steps To Reproduce

I have an Onnx model that was reproduced from PyTorch.
The model also include two plugins that I implemented and successfully were registered using the graphsurgeon tool.

I measure the inference time using two methods: boost c++ chrono library and nvinfer1::IProfiler interface.

When I measure the TensorRT inference time using the boost::chrono it done like this:

boost::chrono::high_resolution_clock::time_point tp1 = boost::chrono::high_resolution_clock::now();*
m_context-> enqueueV2 (&m_modelBuffersDeviceAddr[0], m_stream, nullptr))
boost::chrono::high_resolution_clock::time_point tp2 = boost::chrono::high_resolution_clock::now();*
boost::chrono::microseconds delta = boost::chrono::duration_castboost::chrono::microseconds(tp2 - tp1);*
std::cout << "TRT enqueueV2 time - " << delta.count() / 1000.0 << ‘\n’;*

When I measure the TensorRT inference time using the nvinfer1::IProfiler interface it based on the struct SimpleProfiler that provided by NVIDIA as part of the TensorRT toolkit:

virtual void reportLayerTime(const char* layerName, float ms)
auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r) { return r.first == layerName; });
if (record == mProfile.end())
mProfile.push_back(std::make_pair(layerName, ms));
record->second += ms;

The reportLayerTime method called by the TensprRT engine during the enqueueV2 process and store for each layers its consumed time.

My expectation is that when I will summarize all layers consumed times I will get the same results by both methods.

I have a specific model (Onnx based) that I’m getting a ~20ms gap between them.
For example, Method#1 result is above than 75ms and Method#2 result is ~50ms.

This gap is reproduced also on other PC with older GPU - Quadro M2000M.
There I’m getting much more consumed time results ~7000ms but the gap is stable on ~20ms between the two methods.

For other models which I have (Uff and Onnx based with and without plugins) I’m getting the same results as expected.

Please advise.

Hi @orong13,
Request you to share script and model so that we can try reproducing the same on our end. Also it will be helpful if you can share profiler outputs.


Hello @AakankshaS ,

Sorry for the late response,
It took me a time to get an approval to share my material.

In the mean time I moved to TRT version 7.2.1 (all the rest didn’t changed)

I attached a zip file which conatins the following:

  • Onnx model

  • Profiler output

Please explain what exactly script do you need?


ToNVIDIA.zip (3.2 MB)