TensorRT enqueueV2\enqueue nvinfer1::IProfiler Vs. boost c++ chorno measured time comparison

Description

Using the TensorRT nvinfer1::IProfiler interface for inference time consuming provides ~20ms less than the consumed time that was measured using boost c++ chrono timers interface which warps the enqueueV2 API.

Environment

TensorRT Version - 7.1.3:
GPU Type- GeForce RTX 2080 TI:
Nvidia Driver Version - R451.77:
CUDA Version - 11.0:
CUDNN Version - 8.0.1:
Operating System + Version - Windows 10 Enterprise 2016 LTSB:
Python Version (if applicable) - 3.6.8:
TensorFlow Version (if applicable) - 1.15.3:
PyTorch Version (if applicable) - 1.1.0:

Relevant Files

If further information or material will be required please specify.

Steps To Reproduce

I have an Onnx model that was reproduced from PyTorch.
The model also include two plugins that I implemented and successfully were registered using the graphsurgeon tool.

I measure the inference time using two methods: boost c++ chrono library and nvinfer1::IProfiler interface.

When I measure the TensorRT inference time using the boost::chrono it done like this:

boost::chrono::high_resolution_clock::time_point tp1 = boost::chrono::high_resolution_clock::now();*
m_context-> enqueueV2 (&m_modelBuffersDeviceAddr[0], m_stream, nullptr))
boost::chrono::high_resolution_clock::time_point tp2 = boost::chrono::high_resolution_clock::now();*
boost::chrono::microseconds delta = boost::chrono::duration_castboost::chrono::microseconds(tp2 - tp1);*
std::cout << "TRT enqueueV2 time - " << delta.count() / 1000.0 << ‘\n’;*

When I measure the TensorRT inference time using the nvinfer1::IProfiler interface it based on the struct SimpleProfiler that provided by NVIDIA as part of the TensorRT toolkit:

virtual void reportLayerTime(const char* layerName, float ms)
{
auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r) { return r.first == layerName; });
if (record == mProfile.end())
mProfile.push_back(std::make_pair(layerName, ms));
else
record->second += ms;
}

The reportLayerTime method called by the TensprRT engine during the enqueueV2 process and store for each layers its consumed time.

My expectation is that when I will summarize all layers consumed times I will get the same results by both methods.

I have a specific model (Onnx based) that I’m getting a ~20ms gap between them.
For example, Method#1 result is above than 75ms and Method#2 result is ~50ms.

This gap is reproduced also on other PC with older GPU - Quadro M2000M.
There I’m getting much more consumed time results ~7000ms but the gap is stable on ~20ms between the two methods.

For other models which I have (Uff and Onnx based with and without plugins) I’m getting the same results as expected.

Please advise.

Hi @orong13,
Request you to share script and model so that we can try reproducing the same on our end. Also it will be helpful if you can share profiler outputs.

Thanks!

Hello @AakankshaS ,

Sorry for the late response,
It took me a time to get an approval to share my material.

In the mean time I moved to TRT version 7.2.1 (all the rest didn’t changed)

I attached a zip file which conatins the following:

  • Onnx model

  • Profiler output

Please explain what exactly script do you need?

Thanks,

ToNVIDIA.zip (3.2 MB)