Description
Using the TensorRT nvinfer1::IProfiler interface for inference time consuming provides ~20ms less than the consumed time that was measured using boost c++ chrono timers interface which warps the enqueueV2 API.
Environment
TensorRT Version - 7.1.3:
GPU Type- GeForce RTX 2080 TI:
Nvidia Driver Version - R451.77:
CUDA Version - 11.0:
CUDNN Version - 8.0.1:
Operating System + Version - Windows 10 Enterprise 2016 LTSB:
Python Version (if applicable) - 3.6.8:
TensorFlow Version (if applicable) - 1.15.3:
PyTorch Version (if applicable) - 1.1.0:
Relevant Files
If further information or material will be required please specify.
Steps To Reproduce
I have an Onnx model that was reproduced from PyTorch.
The model also include two plugins that I implemented and successfully were registered using the graphsurgeon tool.
I measure the inference time using two methods: boost c++ chrono library and nvinfer1::IProfiler interface.
When I measure the TensorRT inference time using the boost::chrono it done like this:
boost::chrono::high_resolution_clock::time_point tp1 = boost::chrono::high_resolution_clock::now();*
m_context-> enqueueV2 (&m_modelBuffersDeviceAddr[0], m_stream, nullptr))
boost::chrono::high_resolution_clock::time_point tp2 = boost::chrono::high_resolution_clock::now();*
boost::chrono::microseconds delta = boost::chrono::duration_castboost::chrono::microseconds(tp2 - tp1);*
std::cout << "TRT enqueueV2 time - " << delta.count() / 1000.0 << ‘\n’;*
When I measure the TensorRT inference time using the nvinfer1::IProfiler interface it based on the struct SimpleProfiler that provided by NVIDIA as part of the TensorRT toolkit:
virtual void reportLayerTime(const char* layerName, float ms)
{
auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r) { return r.first == layerName; });
if (record == mProfile.end())
mProfile.push_back(std::make_pair(layerName, ms));
else
record->second += ms;
}
The reportLayerTime method called by the TensprRT engine during the enqueueV2 process and store for each layers its consumed time.
My expectation is that when I will summarize all layers consumed times I will get the same results by both methods.
I have a specific model (Onnx based) that I’m getting a ~20ms gap between them.
For example, Method#1 result is above than 75ms and Method#2 result is ~50ms.
This gap is reproduced also on other PC with older GPU - Quadro M2000M.
There I’m getting much more consumed time results ~7000ms but the gap is stable on ~20ms between the two methods.
For other models which I have (Uff and Onnx based with and without plugins) I’m getting the same results as expected.
Please advise.